[jira] [Commented] (NUTCH-2703) parse-tika: Boilerpipe should not run for non-(X)HTML pages

2019-04-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815285#comment-16815285
 ] 

ASF GitHub Bot commented on NUTCH-2703:
---

sebastian-nagel commented on pull request #449: NUTCH-2703 parse-tika: 
Boilerpipe should not run for non-(X)HTML pages
URL: https://github.com/apache/nutch/pull/449
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> parse-tika: Boilerpipe should not run for non-(X)HTML pages
> ---
>
> Key: NUTCH-2703
> URL: https://issues.apache.org/jira/browse/NUTCH-2703
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.15
>Reporter: Hany Shehata
>Priority: Critical
> Fix For: 1.16
>
> Attachments: NUTCH-2703.patch
>
>
> Boilerpipe is running for non-(X)html pages which is require more resources.
> In my testing scenario, I've large PDFs in my websites and by enabling 
> Boilerpipe I have to assign 8500MB for JAVA Heap to finish the crawl job 
> without issues.
> Disabling Boilerpipe allow me to minimize the JVM Heap to 500MB with no 
> issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2703) parse-tika: Boilerpipe should not run for non-(X)HTML pages

2019-04-11 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815305#comment-16815305
 ] 

Markus Jelsma commented on NUTCH-2703:
--

remote: To git@github:apache/nutch.git
remote:bf75e96..7e6eabb  7e6eabbc2b0a0b5ee91148a9effc6447af5057ba -> master
remote: Syncing refs/heads/master...
remote: Sending notification emails to: ['"comm...@nutch.apache.org" 
']
To https://gitbox.apache.org/repos/asf/nutch.git
   bf75e962..7e6eabbc  master -> master

Thanks all!

> parse-tika: Boilerpipe should not run for non-(X)HTML pages
> ---
>
> Key: NUTCH-2703
> URL: https://issues.apache.org/jira/browse/NUTCH-2703
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.15
>Reporter: Hany Shehata
>Priority: Minor
> Fix For: 1.16
>
> Attachments: NUTCH-2703.patch
>
>
> Boilerpipe is running for non-(X)html pages which is require more resources.
> In my testing scenario, I've large PDFs in my websites and by enabling 
> Boilerpipe I have to assign 8500MB for JAVA Heap to finish the crawl job 
> without issues.
> Disabling Boilerpipe allow me to minimize the JVM Heap to 500MB with no 
> issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2703) parse-tika: Boilerpipe should not run for non-(X)HTML pages

2019-04-11 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-2703.
--
Resolution: Fixed
  Assignee: Markus Jelsma

> parse-tika: Boilerpipe should not run for non-(X)HTML pages
> ---
>
> Key: NUTCH-2703
> URL: https://issues.apache.org/jira/browse/NUTCH-2703
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.15
>Reporter: Hany Shehata
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.16
>
> Attachments: NUTCH-2703.patch
>
>
> Boilerpipe is running for non-(X)html pages which is require more resources.
> In my testing scenario, I've large PDFs in my websites and by enabling 
> Boilerpipe I have to assign 8500MB for JAVA Heap to finish the crawl job 
> without issues.
> Disabling Boilerpipe allow me to minimize the JVM Heap to 500MB with no 
> issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2704) Upgrade crawler-commons dependency to 1.0

2019-04-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815274#comment-16815274
 ] 

ASF GitHub Bot commented on NUTCH-2704:
---

sebastian-nagel commented on pull request #448: NUTCH-2704 Upgrade 
crawler-commons dependency to 1.0
URL: https://github.com/apache/nutch/pull/448
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade crawler-commons dependency to 1.0
> -
>
> Key: NUTCH-2704
> URL: https://issues.apache.org/jira/browse/NUTCH-2704
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> [Crawler-commons 
> 1.0|https://github.com/crawler-commons/crawler-commons/#21st-march-2018crawler-commons-10-released]
>  has been released. We should upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Build failed in Jenkins: Nutch-trunk #3620

2019-04-11 Thread Apache Jenkins Server
See 


Changes:

[markus] NUTCH-2703 parse-tika: Boilerpipe should not run for non-(X)HTML pages

--
[...truncated 5.65 KB...]
[javac] Compiling 298 source files to 

[javac] 
:32:
 error: cannot find symbol
[javac] import com.j256.ormlite.field.ForeignCollectionField;
[javac]  ^
[javac]   symbol:   class ForeignCollectionField
[javac]   location: package com.j256.ormlite.field
[javac] 
:29:
 error: cannot find symbol
[javac] import com.j256.ormlite.field.DatabaseField;
[javac]  ^
[javac]   symbol:   class DatabaseField
[javac]   location: package com.j256.ormlite.field
[javac] 
:28:
 error: cannot find symbol
[javac] import com.j256.ormlite.field.DatabaseField;
[javac]  ^
[javac]   symbol:   class DatabaseField
[javac]   location: package com.j256.ormlite.field
[javac] 
:24:
 error: cannot find symbol
[javac] import com.j256.ormlite.dao.Dao;
[javac]^
[javac]   symbol:   class Dao
[javac]   location: package com.j256.ormlite.dao
[javac] 
:26:
 error: cannot find symbol
[javac] import com.j256.ormlite.support.ConnectionSource;
[javac]^
[javac]   symbol:   class ConnectionSource
[javac]   location: package com.j256.ormlite.support
[javac] 
:29:
 error: cannot find symbol
[javac]   private ConnectionSource connectionSource;
[javac]   ^
[javac]   symbol:   class ConnectionSource
[javac]   location: class CustomDaoFactory
[javac] 
:30:
 error: cannot find symbol
[javac]   private List> registredDaos = Collections
[javac]^
[javac]   symbol:   class Dao
[javac]   location: class CustomDaoFactory
[javac] 
:33:
 error: cannot find symbol
[javac]   public CustomDaoFactory(ConnectionSource connectionSource) {
[javac]   ^
[javac]   symbol:   class ConnectionSource
[javac]   location: class CustomDaoFactory
[javac] 
:37:
 error: cannot find symbol
[javac]   public  Dao createDao(Class clazz) {
[javac]  ^
[javac]   symbol:   class Dao
[javac]   location: class CustomDaoFactory
[javac] 
:47:
 error: cannot find symbol
[javac]   private  void register(Dao dao) {
[javac] ^
[javac]   symbol:   class Dao
[javac]   location: class CustomDaoFactory
[javac] 
:53:
 error: cannot find symbol
[javac]   public List> getCreatedDaos() {
[javac]   ^
[javac]   symbol:   class Dao
[javac]   location: class CustomDaoFactory
[javac] 
:22:
 error: cannot find symbol
[javac] import com.j256.ormlite.dao.BaseDaoImpl;
[javac]^
[javac]   symbol:   class BaseDaoImpl
[javac]   location: package com.j256.ormlite.dao
[javac] 
:23:
 error: cannot find symbol
[javac] import com.j256.ormlite.dao.Dao;
[javac]^
[javac]   symbol:   class Dao
[javac]   location: package com.j256.ormlite.dao
[javac] 
:25:
 error: cannot find symbol
[javac] import com.j256.ormlite.table.DatabaseTableConfig;
[javac]  ^
[javac]   symbol:   class 

[jira] [Commented] (NUTCH-2703) parse-tika: Boilerpipe should not run for non-(X)HTML pages

2019-04-11 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815308#comment-16815308
 ] 

Hudson commented on NUTCH-2703:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3620 (See 
[https://builds.apache.org/job/Nutch-trunk/3620/])
NUTCH-2703 parse-tika: Boilerpipe should not run for non-(X)HTML pages (markus: 
[https://github.com/apache/nutch/commit/7e6eabbc2b0a0b5ee91148a9effc6447af5057ba])
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
* (edit) conf/nutch-default.xml


> parse-tika: Boilerpipe should not run for non-(X)HTML pages
> ---
>
> Key: NUTCH-2703
> URL: https://issues.apache.org/jira/browse/NUTCH-2703
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.15
>Reporter: Hany Shehata
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.16
>
> Attachments: NUTCH-2703.patch
>
>
> Boilerpipe is running for non-(X)html pages which is require more resources.
> In my testing scenario, I've large PDFs in my websites and by enabling 
> Boilerpipe I have to assign 8500MB for JAVA Heap to finish the crawl job 
> without issues.
> Disabling Boilerpipe allow me to minimize the JVM Heap to 500MB with no 
> issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2700) Indexchecker: improve command-line help

2019-04-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815310#comment-16815310
 ] 

ASF GitHub Bot commented on NUTCH-2700:
---

sebastian-nagel commented on pull request #446: NUTCH-2700 Indexchecker: 
improve command-line help
URL: https://github.com/apache/nutch/pull/446
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Indexchecker: improve command-line help
> ---
>
> Key: NUTCH-2700
> URL: https://issues.apache.org/jira/browse/NUTCH-2700
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> The command-line help of the indexchecker tool is incomplete:
> {noformat}
> Usage: IndexingFiltersChecker [-normalize] [-followRedirects] [-dumpText] 
> [-md key=value] (-stdin | -listen  [-keepClientCnxOpen])
> {noformat}
> It does not
> - show the possibility to pass the URL as argument
> - mention the property {{-DdoIndex=true}} which makes it send the document to 
> the indexes
> It should follow the help shown by parsechecker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2690) Configurable and fast URL filter

2019-04-11 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815355#comment-16815355
 ] 

Sebastian Nagel commented on NUTCH-2690:


PR updated, squashed and rebased to current master.
I'll commit next week, but reviews are welcome. Thanks!

Below the benchmark results from the unit tests. While the new plugin 
outperforms urlfilter-regex, the plugin urlfilter-automaton is still faster. 
However, the regular expressions supported by the 
[dk.brics.automaton](https://www.brics.dk/automaton/) are less expressive, eg. 
the "skip URLs with slash-delimited segment that repeats 3+ times" rule cannot 
be expressed because there are no back-references.
{noformat}
% ant test
...

% grep 'bench time' build/urlfilter-regex/test/*.txt | sed -E 's@api\..*( - 
)@\1... @'
2019-04-11 13:46:18,776 INFO  ... - bench time (50) 107ms
2019-04-11 13:46:18,845 INFO  ... - bench time (100) 66ms
2019-04-11 13:46:18,961 INFO  ... - bench time (200) 116ms
2019-04-11 13:46:19,192 INFO  ... - bench time (400) 231ms
2019-04-11 13:46:19,663 INFO  ... - bench time (800) 471ms

% grep 'bench time' build/urlfilter-fast/test/*.txt | sed -E 's@api\..*( - 
)@\1... @'
2019-04-11 13:48:05,024 INFO  ... - bench time (50) 72ms
2019-04-11 13:48:05,112 INFO  ... - bench time (100) 84ms
2019-04-11 13:48:05,233 INFO  ... - bench time (200) 121ms
2019-04-11 13:48:05,446 INFO  ... - bench time (400) 213ms
2019-04-11 13:48:05,687 INFO  ... - bench time (800) 241ms

% grep 'bench time' build/urlfilter-automaton/test/*.txt | sed -E 's@api\..*( - 
)@\1... @'
2019-04-11 13:43:11,794 INFO  ... - bench time (50) 43ms
2019-04-11 13:43:11,834 INFO  ... - bench time (100) 37ms
2019-04-11 13:43:11,899 INFO  ... - bench time (200) 65ms
2019-04-11 13:43:11,996 INFO  ... - bench time (400) 97ms
2019-04-11 13:43:12,175 INFO  ... - bench time (800) 178ms
{noformat}


> Configurable and fast URL filter
> 
>
> Key: NUTCH-2690
> URL: https://issues.apache.org/jira/browse/NUTCH-2690
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> This improvement introduces a new URL filter plugin "urlfilter-fast" (naming 
> debatable) which is in use at Common Crawl [since 
> 2013|https://github.com/commoncrawl/nutch/commit/968e0d8f292bed46e4e3eb276cb475f4403ea9bd]
>  to apply a long list of filters. 
> # an exact (suffix) match against the host name is done to retrieve 
> host/domain-specific regex rules
> # applies a regular expression against the path (and query) component of the 
> URL
> What makes it faster than urlfilter-regex for common cases:
> - regexes are selected by host name or it's domain suffix, so there are 
> usually fewer rules to be checked. That's similar to NUTCH-1838 but any 
> domain suffix can be matched including {{subdomain.domain.com}}, {{com}} or 
> {{.}} for global rules. The selection by host name suffix is considerably 
> fast.
> - regexes are applied only to the path component (optionally including the 
> query) and not the entire URL.
>   Matching against a shorter string can make a huge difference for more 
> complex regular expressions.
> - the rule to deny everything from a host or domain gets special treatment to 
> be fast
> More details about the rule format are found in the plugin's 
> [README|https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/README.md].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-2690) Configurable and fast URL filter

2019-04-11 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815355#comment-16815355
 ] 

Sebastian Nagel edited comment on NUTCH-2690 at 4/11/19 11:53 AM:
--

PR updated, squashed and rebased to current master.
 I'll commit next week, but reviews are welcome. Thanks!

Below the benchmark results from the unit tests. While the new plugin 
outperforms urlfilter-regex, the plugin urlfilter-automaton is still faster. 
However, the regular expressions supported by the 
[dk.brics.automaton|https://www.brics.dk/automaton/] library are less 
expressive, eg. the "skip URLs with slash-delimited segment that repeats 3+ 
times" rule cannot be expressed because there are no back-references.
{noformat}
% ant test
...

% grep 'bench time' build/urlfilter-regex/test/*.txt | sed -E 's@api\..*( - 
)@\1... @'
2019-04-11 13:46:18,776 INFO  ... - bench time (50) 107ms
2019-04-11 13:46:18,845 INFO  ... - bench time (100) 66ms
2019-04-11 13:46:18,961 INFO  ... - bench time (200) 116ms
2019-04-11 13:46:19,192 INFO  ... - bench time (400) 231ms
2019-04-11 13:46:19,663 INFO  ... - bench time (800) 471ms

% grep 'bench time' build/urlfilter-fast/test/*.txt | sed -E 's@api\..*( - 
)@\1... @'
2019-04-11 13:48:05,024 INFO  ... - bench time (50) 72ms
2019-04-11 13:48:05,112 INFO  ... - bench time (100) 84ms
2019-04-11 13:48:05,233 INFO  ... - bench time (200) 121ms
2019-04-11 13:48:05,446 INFO  ... - bench time (400) 213ms
2019-04-11 13:48:05,687 INFO  ... - bench time (800) 241ms

% grep 'bench time' build/urlfilter-automaton/test/*.txt | sed -E 's@api\..*( - 
)@\1... @'
2019-04-11 13:43:11,794 INFO  ... - bench time (50) 43ms
2019-04-11 13:43:11,834 INFO  ... - bench time (100) 37ms
2019-04-11 13:43:11,899 INFO  ... - bench time (200) 65ms
2019-04-11 13:43:11,996 INFO  ... - bench time (400) 97ms
2019-04-11 13:43:12,175 INFO  ... - bench time (800) 178ms
{noformat}


was (Author: wastl-nagel):
PR updated, squashed and rebased to current master.
I'll commit next week, but reviews are welcome. Thanks!

Below the benchmark results from the unit tests. While the new plugin 
outperforms urlfilter-regex, the plugin urlfilter-automaton is still faster. 
However, the regular expressions supported by the 
[dk.brics.automaton](https://www.brics.dk/automaton/) are less expressive, eg. 
the "skip URLs with slash-delimited segment that repeats 3+ times" rule cannot 
be expressed because there are no back-references.
{noformat}
% ant test
...

% grep 'bench time' build/urlfilter-regex/test/*.txt | sed -E 's@api\..*( - 
)@\1... @'
2019-04-11 13:46:18,776 INFO  ... - bench time (50) 107ms
2019-04-11 13:46:18,845 INFO  ... - bench time (100) 66ms
2019-04-11 13:46:18,961 INFO  ... - bench time (200) 116ms
2019-04-11 13:46:19,192 INFO  ... - bench time (400) 231ms
2019-04-11 13:46:19,663 INFO  ... - bench time (800) 471ms

% grep 'bench time' build/urlfilter-fast/test/*.txt | sed -E 's@api\..*( - 
)@\1... @'
2019-04-11 13:48:05,024 INFO  ... - bench time (50) 72ms
2019-04-11 13:48:05,112 INFO  ... - bench time (100) 84ms
2019-04-11 13:48:05,233 INFO  ... - bench time (200) 121ms
2019-04-11 13:48:05,446 INFO  ... - bench time (400) 213ms
2019-04-11 13:48:05,687 INFO  ... - bench time (800) 241ms

% grep 'bench time' build/urlfilter-automaton/test/*.txt | sed -E 's@api\..*( - 
)@\1... @'
2019-04-11 13:43:11,794 INFO  ... - bench time (50) 43ms
2019-04-11 13:43:11,834 INFO  ... - bench time (100) 37ms
2019-04-11 13:43:11,899 INFO  ... - bench time (200) 65ms
2019-04-11 13:43:11,996 INFO  ... - bench time (400) 97ms
2019-04-11 13:43:12,175 INFO  ... - bench time (800) 178ms
{noformat}


> Configurable and fast URL filter
> 
>
> Key: NUTCH-2690
> URL: https://issues.apache.org/jira/browse/NUTCH-2690
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> This improvement introduces a new URL filter plugin "urlfilter-fast" (naming 
> debatable) which is in use at Common Crawl [since 
> 2013|https://github.com/commoncrawl/nutch/commit/968e0d8f292bed46e4e3eb276cb475f4403ea9bd]
>  to apply a long list of filters. 
> # an exact (suffix) match against the host name is done to retrieve 
> host/domain-specific regex rules
> # applies a regular expression against the path (and query) component of the 
> URL
> What makes it faster than urlfilter-regex for common cases:
> - regexes are selected by host name or it's domain suffix, so there are 
> usually fewer rules to be checked. That's similar to NUTCH-1838 but any 
> domain suffix can be matched including {{subdomain.domain.com}}, {{com}} or 
> {{.}} for global rules. The selection by host name suffix is considerably 
> fast.
> - regexes are applied only to the path component 

[jira] [Assigned] (NUTCH-2279) LinkRank fails when using Hadoop MR output compression

2019-04-11 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2279:
--

Assignee: Sebastian Nagel

> LinkRank fails when using Hadoop MR output compression
> --
>
> Key: NUTCH-2279
> URL: https://issues.apache.org/jira/browse/NUTCH-2279
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.12
>Reporter: Joseph Naegele
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> When using MapReduce job output compression, i.e. 
> {{mapreduce.output.fileoutputformat.compress=true}}, LinkRank can't read the 
> results of its {{Counter}} MR job due to the additional, generated file 
> extension.
> For example, using the default compression codec (which appears to be 
> DEFLATE), the counter file is written to 
> {{crawl/webgraph/_num_nodes_/part-0.deflate}}. Then, the LinkRank job 
> attempts to manually read this file to obtain the number of links using the 
> following code:
> {code}
> FSDataInputStream readLinks = fs.open(new Path(numLinksPath, "part-0"));
> {code}
> which fails because the file {{part-0}} doesn't exist:
> {code}
> LinkAnalysis: java.io.FileNotFoundException: File 
> crawl/webgraph/_num_nodes_/part-0 does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:819)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:596)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:140)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
> at 
> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:124)
> at 
> org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:633)
> at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:713)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:680)
> {code}
> To reproduce, add {{-D mapreduce.output.fileoutputformat.compress=true}} to 
> the properties for {{bin/nutch linkrank ...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2708) urlfilter-automaton: update library dependency (dk.brics.automaton)

2019-04-11 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2708:
--

 Summary: urlfilter-automaton: update library dependency 
(dk.brics.automaton)
 Key: NUTCH-2708
 URL: https://issues.apache.org/jira/browse/NUTCH-2708
 Project: Nutch
  Issue Type: Improvement
  Components: urlfilter, plugin
Affects Versions: 1.15
Reporter: Sebastian Nagel
 Fix For: 1.16


A new version of the [dk.brics.automaton|https://www.brics.dk/automaton/] 
library (1.12-1) is available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2703) parse-tika: Boilerpipe should not run for non-(X)HTML pages

2019-04-11 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815282#comment-16815282
 ] 

Sebastian Nagel commented on NUTCH-2703:


+1

But I would opt to make it configurable. I'll open a PR for that.

> parse-tika: Boilerpipe should not run for non-(X)HTML pages
> ---
>
> Key: NUTCH-2703
> URL: https://issues.apache.org/jira/browse/NUTCH-2703
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.15
>Reporter: Hany Shehata
>Priority: Critical
> Fix For: 1.16
>
> Attachments: NUTCH-2703.patch
>
>
> Boilerpipe is running for non-(X)html pages which is require more resources.
> In my testing scenario, I've large PDFs in my websites and by enabling 
> Boilerpipe I have to assign 8500MB for JAVA Heap to finish the crawl job 
> without issues.
> Disabling Boilerpipe allow me to minimize the JVM Heap to 500MB with no 
> issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work started] (NUTCH-2700) Indexchecker: improve command-line help

2019-04-11 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2700 started by Sebastian Nagel.
--
> Indexchecker: improve command-line help
> ---
>
> Key: NUTCH-2700
> URL: https://issues.apache.org/jira/browse/NUTCH-2700
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> The command-line help of the indexchecker tool is incomplete:
> {noformat}
> Usage: IndexingFiltersChecker [-normalize] [-followRedirects] [-dumpText] 
> [-md key=value] (-stdin | -listen  [-keepClientCnxOpen])
> {noformat}
> It does not
> - show the possibility to pass the URL as argument
> - mention the property {{-DdoIndex=true}} which makes it send the document to 
> the indexes
> It should follow the help shown by parsechecker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (NUTCH-2700) Indexchecker: improve command-line help

2019-04-11 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2700:
--

Assignee: Sebastian Nagel

> Indexchecker: improve command-line help
> ---
>
> Key: NUTCH-2700
> URL: https://issues.apache.org/jira/browse/NUTCH-2700
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> The command-line help of the indexchecker tool is incomplete:
> {noformat}
> Usage: IndexingFiltersChecker [-normalize] [-followRedirects] [-dumpText] 
> [-md key=value] (-stdin | -listen  [-keepClientCnxOpen])
> {noformat}
> It does not
> - show the possibility to pass the URL as argument
> - mention the property {{-DdoIndex=true}} which makes it send the document to 
> the indexes
> It should follow the help shown by parsechecker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2700) Indexchecker: improve command-line help

2019-04-11 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2700.

Resolution: Implemented

> Indexchecker: improve command-line help
> ---
>
> Key: NUTCH-2700
> URL: https://issues.apache.org/jira/browse/NUTCH-2700
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> The command-line help of the indexchecker tool is incomplete:
> {noformat}
> Usage: IndexingFiltersChecker [-normalize] [-followRedirects] [-dumpText] 
> [-md key=value] (-stdin | -listen  [-keepClientCnxOpen])
> {noformat}
> It does not
> - show the possibility to pass the URL as argument
> - mention the property {{-DdoIndex=true}} which makes it send the document to 
> the indexes
> It should follow the help shown by parsechecker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2708) urlfilter-automaton: update library dependency (dk.brics.automaton)

2019-04-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815360#comment-16815360
 ] 

ASF GitHub Bot commented on NUTCH-2708:
---

sebastian-nagel commented on pull request #450: NUTCH-2708 urlfilter-automaton: 
update library dependency (dk.brics.automaton)
URL: https://github.com/apache/nutch/pull/450
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> urlfilter-automaton: update library dependency (dk.brics.automaton)
> ---
>
> Key: NUTCH-2708
> URL: https://issues.apache.org/jira/browse/NUTCH-2708
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, urlfilter
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.16
>
>
> A new version of the [dk.brics.automaton|https://www.brics.dk/automaton/] 
> library (1.12-1) is available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2703) parse-tika: Boilerpipe should not run for non-(X)HTML pages

2019-04-11 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2703:
-
Priority: Minor  (was: Critical)

> parse-tika: Boilerpipe should not run for non-(X)HTML pages
> ---
>
> Key: NUTCH-2703
> URL: https://issues.apache.org/jira/browse/NUTCH-2703
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.15
>Reporter: Hany Shehata
>Priority: Minor
> Fix For: 1.16
>
> Attachments: NUTCH-2703.patch
>
>
> Boilerpipe is running for non-(X)html pages which is require more resources.
> In my testing scenario, I've large PDFs in my websites and by enabling 
> Boilerpipe I have to assign 8500MB for JAVA Heap to finish the crawl job 
> without issues.
> Disabling Boilerpipe allow me to minimize the JVM Heap to 500MB with no 
> issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2703) parse-tika: Boilerpipe should not run for non-(X)HTML pages

2019-04-11 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815302#comment-16815302
 ] 

Markus Jelsma commented on NUTCH-2703:
--

Thanks for not missing both MIME types, text/html AND 
application/xhtml+xml.I'll get this committed!



> parse-tika: Boilerpipe should not run for non-(X)HTML pages
> ---
>
> Key: NUTCH-2703
> URL: https://issues.apache.org/jira/browse/NUTCH-2703
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.15
>Reporter: Hany Shehata
>Priority: Minor
> Fix For: 1.16
>
> Attachments: NUTCH-2703.patch
>
>
> Boilerpipe is running for non-(X)html pages which is require more resources.
> In my testing scenario, I've large PDFs in my websites and by enabling 
> Boilerpipe I have to assign 8500MB for JAVA Heap to finish the crawl job 
> without issues.
> Disabling Boilerpipe allow me to minimize the JVM Heap to 500MB with no 
> issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Build failed in Jenkins: Nutch-trunk #3621

2019-04-11 Thread Apache Jenkins Server
See 


Changes:

[snagel] NUTCH-2700 Indexchecker: improve command-line help - add options

--
[...truncated 5.66 KB...]
[javac] Compiling 298 source files to 

[javac] 
:32:
 error: cannot find symbol
[javac] import com.j256.ormlite.field.ForeignCollectionField;
[javac]  ^
[javac]   symbol:   class ForeignCollectionField
[javac]   location: package com.j256.ormlite.field
[javac] 
:29:
 error: cannot find symbol
[javac] import com.j256.ormlite.field.DatabaseField;
[javac]  ^
[javac]   symbol:   class DatabaseField
[javac]   location: package com.j256.ormlite.field
[javac] 
:28:
 error: cannot find symbol
[javac] import com.j256.ormlite.field.DatabaseField;
[javac]  ^
[javac]   symbol:   class DatabaseField
[javac]   location: package com.j256.ormlite.field
[javac] 
:24:
 error: cannot find symbol
[javac] import com.j256.ormlite.dao.Dao;
[javac]^
[javac]   symbol:   class Dao
[javac]   location: package com.j256.ormlite.dao
[javac] 
:26:
 error: cannot find symbol
[javac] import com.j256.ormlite.support.ConnectionSource;
[javac]^
[javac]   symbol:   class ConnectionSource
[javac]   location: package com.j256.ormlite.support
[javac] 
:29:
 error: cannot find symbol
[javac]   private ConnectionSource connectionSource;
[javac]   ^
[javac]   symbol:   class ConnectionSource
[javac]   location: class CustomDaoFactory
[javac] 
:30:
 error: cannot find symbol
[javac]   private List> registredDaos = Collections
[javac]^
[javac]   symbol:   class Dao
[javac]   location: class CustomDaoFactory
[javac] 
:33:
 error: cannot find symbol
[javac]   public CustomDaoFactory(ConnectionSource connectionSource) {
[javac]   ^
[javac]   symbol:   class ConnectionSource
[javac]   location: class CustomDaoFactory
[javac] 
:37:
 error: cannot find symbol
[javac]   public  Dao createDao(Class clazz) {
[javac]  ^
[javac]   symbol:   class Dao
[javac]   location: class CustomDaoFactory
[javac] 
:47:
 error: cannot find symbol
[javac]   private  void register(Dao dao) {
[javac] ^
[javac]   symbol:   class Dao
[javac]   location: class CustomDaoFactory
[javac] 
:53:
 error: cannot find symbol
[javac]   public List> getCreatedDaos() {
[javac]   ^
[javac]   symbol:   class Dao
[javac]   location: class CustomDaoFactory
[javac] 
:22:
 error: cannot find symbol
[javac] import com.j256.ormlite.dao.BaseDaoImpl;
[javac]^
[javac]   symbol:   class BaseDaoImpl
[javac]   location: package com.j256.ormlite.dao
[javac] 
:23:
 error: cannot find symbol
[javac] import com.j256.ormlite.dao.Dao;
[javac]^
[javac]   symbol:   class Dao
[javac]   location: package com.j256.ormlite.dao
[javac] 
:25:
 error: cannot find symbol
[javac] import com.j256.ormlite.table.DatabaseTableConfig;
[javac]  ^
[javac]   symbol:   class 

[jira] [Commented] (NUTCH-2700) Indexchecker: improve command-line help

2019-04-11 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815350#comment-16815350
 ] 

Hudson commented on NUTCH-2700:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3621 (See 
[https://builds.apache.org/job/Nutch-trunk/3621/])
NUTCH-2700 Indexchecker: improve command-line help - add options (snagel: 
[https://github.com/apache/nutch/commit/76c8cff1402e217049942bac88a8a005d45abf43])
* (edit) src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* (edit) src/java/org/apache/nutch/parse/ParserChecker.java


> Indexchecker: improve command-line help
> ---
>
> Key: NUTCH-2700
> URL: https://issues.apache.org/jira/browse/NUTCH-2700
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> The command-line help of the indexchecker tool is incomplete:
> {noformat}
> Usage: IndexingFiltersChecker [-normalize] [-followRedirects] [-dumpText] 
> [-md key=value] (-stdin | -listen  [-keepClientCnxOpen])
> {noformat}
> It does not
> - show the possibility to pass the URL as argument
> - mention the property {{-DdoIndex=true}} which makes it send the document to 
> the indexes
> It should follow the help shown by parsechecker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2703) parse-tika: Boilerpipe should not run for non-(X)HTML pages

2019-04-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815383#comment-16815383
 ] 

ASF GitHub Bot commented on NUTCH-2703:
---

sebastian-nagel commented on pull request #449: NUTCH-2703 parse-tika: 
Boilerpipe should not run for non-(X)HTML pages
URL: https://github.com/apache/nutch/pull/449
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> parse-tika: Boilerpipe should not run for non-(X)HTML pages
> ---
>
> Key: NUTCH-2703
> URL: https://issues.apache.org/jira/browse/NUTCH-2703
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, plugin
>Affects Versions: 1.15
>Reporter: Hany Shehata
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.16
>
> Attachments: NUTCH-2703.patch
>
>
> Boilerpipe is running for non-(X)html pages which is require more resources.
> In my testing scenario, I've large PDFs in my websites and by enabling 
> Boilerpipe I have to assign 8500MB for JAVA Heap to finish the crawl job 
> without issues.
> Disabling Boilerpipe allow me to minimize the JVM Heap to 500MB with no 
> issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)