[jira] [Created] (NUTCH-2403) Nutch Selenium: Wrong documentation about PhantomJS

2017-07-21 Thread Moreno Feltscher (JIRA)
Moreno Feltscher created NUTCH-2403:
---

 Summary: Nutch Selenium: Wrong documentation about PhantomJS
 Key: NUTCH-2403
 URL: https://issues.apache.org/jira/browse/NUTCH-2403
 Project: Nutch
  Issue Type: Bug
Reporter: Moreno Feltscher


The Nutch Selenium documentation states that PhantomJS can be used as 
{{phantomJS}} for {{selenium.driver}}. The correct value would be {{phantomjs}} 
according to 
https://github.com/apache/nutch/blob/master/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java#L124



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2486) Compiler Warning: Unchecked / unsafe operations in MimeTypeIndexingFilter

2017-12-19 Thread Moreno Feltscher (JIRA)
Moreno Feltscher created NUTCH-2486:
---

 Summary: Compiler Warning: Unchecked / unsafe operations in 
MimeTypeIndexingFilter
 Key: NUTCH-2486
 URL: https://issues.apache.org/jira/browse/NUTCH-2486
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.14
Reporter: Moreno Feltscher


When compiling Nutch source, the following warning is being shown:
{quote}
Note: 
src/plugin/mimetype-filter/src/java/org/apache/nutch/indexer/filter/MimeTypeIndexingFilter.java
 uses unchecked or unsafe operations.
{quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2473) Elasticsearch REST Indexer broken due to wrong depenency

2017-12-07 Thread Moreno Feltscher (JIRA)
Moreno Feltscher created NUTCH-2473:
---

 Summary: Elasticsearch REST Indexer broken due to wrong depenency
 Key: NUTCH-2473
 URL: https://issues.apache.org/jira/browse/NUTCH-2473
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.14
Reporter: Moreno Feltscher


When trying to index into Elasticsearch using {{indexer-elastic-rest}} the 
following error is being thrown:
{code}
Exception in thread "main" java.lang.LinkageError: loader constraint violation: 
when resolving method 
"org.slf4j.impl.StaticLoggerBinder.getLoggerFactory()Lorg/slf4j/ILoggerFactory;"
 the class loader (instance of org/apache/nutch/plugin/PluginClassLoader) of 
the current class, org/slf4j/LoggerFactory, and the class loader (instance of 
sun/misc/Launcher$AppClassLoader) for the method's defining class, 
org/slf4j/impl/StaticLoggerBinder, have different Class objects for the type 
org/slf4j/ILoggerFactory used in the signature
at org.slf4j.LoggerFactory.getILoggerFactory(LoggerFactory.java:418)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:357)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:383)
at 
org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter.(ElasticRestIndexWriter.java:71)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:161)
at org.apache.nutch.indexer.IndexWriters.(IndexWriters.java:57)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
{code}

[e66d44d|https://github.com/apache/nutch/commit/e66d44d9c290c550e78edb425a43e010b861172c#diff-aefa48b9ce916d2e33dc27b153c44977]
 removed the runtime dependency on {{slf4j-api-1.7.21.jar}} everywhere but in 
{{indexer-elastic-rest}}.
Possible fix: https://github.com/apache/nutch/pull/253



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (NUTCH-2473) Elasticsearch REST Indexer broken due to wrong depenency

2017-12-07 Thread Moreno Feltscher (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moreno Feltscher reassigned NUTCH-2473:
---

Assignee: Sebastian Nagel

> Elasticsearch REST Indexer broken due to wrong depenency
> 
>
> Key: NUTCH-2473
> URL: https://issues.apache.org/jira/browse/NUTCH-2473
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Moreno Feltscher
>Assignee: Sebastian Nagel
>
> When trying to index into Elasticsearch using {{indexer-elastic-rest}} the 
> following error is being thrown:
> {code}
> Exception in thread "main" java.lang.LinkageError: loader constraint 
> violation: when resolving method 
> "org.slf4j.impl.StaticLoggerBinder.getLoggerFactory()Lorg/slf4j/ILoggerFactory;"
>  the class loader (instance of org/apache/nutch/plugin/PluginClassLoader) of 
> the current class, org/slf4j/LoggerFactory, and the class loader (instance of 
> sun/misc/Launcher$AppClassLoader) for the method's defining class, 
> org/slf4j/impl/StaticLoggerBinder, have different Class objects for the type 
> org/slf4j/ILoggerFactory used in the signature
> at org.slf4j.LoggerFactory.getILoggerFactory(LoggerFactory.java:418)
> at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:357)
> at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:383)
> at 
> org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter.(ElasticRestIndexWriter.java:71)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at java.lang.Class.newInstance(Class.java:442)
> at 
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:161)
> at org.apache.nutch.indexer.IndexWriters.(IndexWriters.java:57)
> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
> {code}
> [e66d44d|https://github.com/apache/nutch/commit/e66d44d9c290c550e78edb425a43e010b861172c#diff-aefa48b9ce916d2e33dc27b153c44977]
>  removed the runtime dependency on {{slf4j-api-1.7.21.jar}} everywhere but in 
> {{indexer-elastic-rest}}.
> Possible fix: https://github.com/apache/nutch/pull/253



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2493) Add configuration parameter for sitemap processing to crawler script

2018-01-08 Thread Moreno Feltscher (JIRA)
Moreno Feltscher created NUTCH-2493:
---

 Summary: Add configuration parameter for sitemap processing to 
crawler script
 Key: NUTCH-2493
 URL: https://issues.apache.org/jira/browse/NUTCH-2493
 Project: Nutch
  Issue Type: Improvement
Reporter: Moreno Feltscher
Assignee: Moreno Feltscher






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2491) Integrate sitemap processing and HostDB into crawl script

2018-01-03 Thread Moreno Feltscher (JIRA)
Moreno Feltscher created NUTCH-2491:
---

 Summary: Integrate sitemap processing and HostDB into crawl script
 Key: NUTCH-2491
 URL: https://issues.apache.org/jira/browse/NUTCH-2491
 Project: Nutch
  Issue Type: Improvement
Reporter: Moreno Feltscher
Assignee: Moreno Feltscher
Priority: Minor


Add three new steps to the crawl bash script:
1. Generate HostDB from CrawlDB
2. Inject URLs from sitemaps URLs found in hosts from HostDb
3. If given, inject sitemap URLs specified in a configuration file / in 
configuration files



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2493) Add configuration parameter for sitemap processing to crawler script

2018-01-08 Thread Moreno Feltscher (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moreno Feltscher updated NUTCH-2493:

Description: 
While using the crawler script with the sitemap processing feature introduced 
in NUTCH-2491 I encountered some performance issues when working with large 
sitemaps.
Therefore one should be able to specify if sitemap processing based on HostDB 
should take place and if so how frequently it should be done.

> Add configuration parameter for sitemap processing to crawler script
> 
>
> Key: NUTCH-2493
> URL: https://issues.apache.org/jira/browse/NUTCH-2493
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>
> While using the crawler script with the sitemap processing feature introduced 
> in NUTCH-2491 I encountered some performance issues when working with large 
> sitemaps.
> Therefore one should be able to specify if sitemap processing based on HostDB 
> should take place and if so how frequently it should be done.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2499) Elastic REST Indexer: Duplicate values

2018-01-16 Thread Moreno Feltscher (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moreno Feltscher updated NUTCH-2499:

Description: Due to a change in 
https://github.com/smartive/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e
 the Elastic REST indexer does not work with HashSets for values anymore but 
instead saves duplicated values as arrays.

> Elastic REST Indexer: Duplicate values
> --
>
> Key: NUTCH-2499
> URL: https://issues.apache.org/jira/browse/NUTCH-2499
> Project: Nutch
>  Issue Type: Bug
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
>
> Due to a change in 
> https://github.com/smartive/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e
>  the Elastic REST indexer does not work with HashSets for values anymore but 
> instead saves duplicated values as arrays.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2499) Elastic REST Indexer: Duplicate values

2018-01-16 Thread Moreno Feltscher (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moreno Feltscher updated NUTCH-2499:

Description: Due to a change in 
https://github.com/apache/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e#diff-408fd2f17bc9791dcbf531ffe6574a6a
 the Elastic REST indexer does not work with HashSets for values anymore but 
instead saves duplicated values as arrays.  (was: Due to a change in 
https://github.com/smartive/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e
 the Elastic REST indexer does not work with HashSets for values anymore but 
instead saves duplicated values as arrays.)

> Elastic REST Indexer: Duplicate values
> --
>
> Key: NUTCH-2499
> URL: https://issues.apache.org/jira/browse/NUTCH-2499
> Project: Nutch
>  Issue Type: Bug
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
>
> Due to a change in 
> https://github.com/apache/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e#diff-408fd2f17bc9791dcbf531ffe6574a6a
>  the Elastic REST indexer does not work with HashSets for values anymore but 
> instead saves duplicated values as arrays.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script

2018-01-15 Thread Moreno Feltscher (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326640#comment-16326640
 ] 

Moreno Feltscher commented on NUTCH-2496:
-

[~markus17]: Thanks for that hint. This is something I still don't really get. 
Where and to what steps exactly are those filters/normalizers being applied?

In my case I only have a {{regex-urlfilter.txt}} file as well as the following 
plugin configuration:
{code:xml}

plugin.includes


protocol-httpclient|protocol-http|urlfilter-regex|index-(basic|anchor|metadata)|headings|language-identifier|query-(basic|site|url|lang)|indexer-elastic-rest|parse-(text|html|tika|metatags)|urlnormalizer-(pass|regex|basic)


{code}

Would it make sense to disable filtering/normalization in LinkDB?

> Speed up link inversion step in crawling script
> ---
>
> Key: NUTCH-2496
> URL: https://issues.apache.org/jira/browse/NUTCH-2496
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>
> While working on a project where I have to index a huge number of URLs I 
> encountered an issue with the link inversion step of the crawling script. A 
> while ago Ian Lopata stumbled upon the same issue as described here: 
> http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html
> {quote}
> I am running the invertlinks step in my Nutch 1.6 based crawl process on a 
> single node.  I run invertlinks only because I need the Inlinks in the 
> indexer step so as to store them with the document.  I do not need the 
> anchor text and I am not scoring.  I am finding that invertlinks (and more 
> specifically the merge of the linkdb) takes a long time - about 30 minutes 
> for a crawl of around 150K documents.  I am looking for ways that I might 
> shorten this processing time.  Any suggestions? 
> {quote}
> Back then [~wastl-nagel] suggested turning off the normalizers and filters 
> during the inversion step which speeds up the process a bunch.
> In my case however I kind of depend on those so this is no real solution.
> I opened this issue here in order to get some feedback on how we could 
> improve things in a crawl script and speed up the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2502) Any23 Plugin: Add Content-Type filtering

2018-01-23 Thread Moreno Feltscher (JIRA)
Moreno Feltscher created NUTCH-2502:
---

 Summary: Any23 Plugin: Add Content-Type filtering
 Key: NUTCH-2502
 URL: https://issues.apache.org/jira/browse/NUTCH-2502
 Project: Nutch
  Issue Type: Improvement
Reporter: Moreno Feltscher
Assignee: Moreno Feltscher


It should be possible to filter based on a document's Content-Type when using 
Any23 extractors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2499) Elastic REST Indexer: Duplicate values

2018-01-16 Thread Moreno Feltscher (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moreno Feltscher updated NUTCH-2499:

Environment: (was: Due to a change in 
https://github.com/smartive/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e
 the Elastic REST indexer does not work with HashSets for values anymore but 
instead saves duplicated values as arrays.)

> Elastic REST Indexer: Duplicate values
> --
>
> Key: NUTCH-2499
> URL: https://issues.apache.org/jira/browse/NUTCH-2499
> Project: Nutch
>  Issue Type: Bug
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2499) Elastic REST Indexer: Duplicate values

2018-01-16 Thread Moreno Feltscher (JIRA)
Moreno Feltscher created NUTCH-2499:
---

 Summary: Elastic REST Indexer: Duplicate values
 Key: NUTCH-2499
 URL: https://issues.apache.org/jira/browse/NUTCH-2499
 Project: Nutch
  Issue Type: Bug
 Environment: Due to a change in 
https://github.com/smartive/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e
 the Elastic REST indexer does not work with HashSets for values anymore but 
instead saves duplicated values as arrays.
Reporter: Moreno Feltscher
Assignee: Moreno Feltscher






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script

2018-01-17 Thread Moreno Feltscher (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329760#comment-16329760
 ] 

Moreno Feltscher commented on NUTCH-2496:
-

Thanks again for clearing things up even more.

One last question about the "changing normalizers and/or filters" though: What 
happens if I change let's say my filters and after that I do a full re-crawl 
(inject - generate - fetch - parse - update - link inversion - index - index 
cleanup) without having filtering turned on in my link inversion step? Would 
Nutch take into account the new filters and eventually drop documents that do 
not match the filters anymore from my index?

> Speed up link inversion step in crawling script
> ---
>
> Key: NUTCH-2496
> URL: https://issues.apache.org/jira/browse/NUTCH-2496
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>
> While working on a project where I have to index a huge number of URLs I 
> encountered an issue with the link inversion step of the crawling script. A 
> while ago Ian Lopata stumbled upon the same issue as described here: 
> http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html
> {quote}
> I am running the invertlinks step in my Nutch 1.6 based crawl process on a 
> single node.  I run invertlinks only because I need the Inlinks in the 
> indexer step so as to store them with the document.  I do not need the 
> anchor text and I am not scoring.  I am finding that invertlinks (and more 
> specifically the merge of the linkdb) takes a long time - about 30 minutes 
> for a crawl of around 150K documents.  I am looking for ways that I might 
> shorten this processing time.  Any suggestions? 
> {quote}
> Back then [~wastl-nagel] suggested turning off the normalizers and filters 
> during the inversion step which speeds up the process a bunch.
> In my case however I kind of depend on those so this is no real solution.
> I opened this issue here in order to get some feedback on how we could 
> improve things in a crawl script and speed up the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2497) Elastic REST Indexer: Allow multiple hosts

2018-01-12 Thread Moreno Feltscher (JIRA)
Moreno Feltscher created NUTCH-2497:
---

 Summary: Elastic REST Indexer: Allow multiple hosts
 Key: NUTCH-2497
 URL: https://issues.apache.org/jira/browse/NUTCH-2497
 Project: Nutch
  Issue Type: Improvement
Reporter: Moreno Feltscher
Assignee: Moreno Feltscher


Allow specifying a list of Elasticsearch hosts to index documents to. This 
would be especially helpful when working with a Elasticsearch cluster which 
contains of multiple nodes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script

2018-01-12 Thread Moreno Feltscher (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324737#comment-16324737
 ] 

Moreno Feltscher commented on NUTCH-2496:
-

One thing I found out is that if I do the link inversion step after all the 
iterations are done it takes a lot less time. Would it be feasible to move the 
link inversion and indexing step out of the loop and do it only once in the 
end? Any thoughts about this?

> Speed up link inversion step in crawling script
> ---
>
> Key: NUTCH-2496
> URL: https://issues.apache.org/jira/browse/NUTCH-2496
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>
> While working on a project where I have to index a huge number of URLs I 
> encountered an issue with the link inversion step of the crawling script. A 
> while ago Ian Lopata stumbled upon the same issue as described here: 
> http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html
> {quote}
> I am running the invertlinks step in my Nutch 1.6 based crawl process on a 
> single node.  I run invertlinks only because I need the Inlinks in the 
> indexer step so as to store them with the document.  I do not need the 
> anchor text and I am not scoring.  I am finding that invertlinks (and more 
> specifically the merge of the linkdb) takes a long time - about 30 minutes 
> for a crawl of around 150K documents.  I am looking for ways that I might 
> shorten this processing time.  Any suggestions? 
> {quote}
> Back then [~wastl-nagel] suggested turning off the normalizers and filters 
> during the inversion step which speeds up the process a bunch.
> In my case however I kind of depend on those so this is no real solution.
> I opened this issue here in order to get some feedback on how we could 
> improve things in a crawl script and speed up the process.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (NUTCH-2496) Speed up link inversion step in crawling script

2018-01-12 Thread Moreno Feltscher (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moreno Feltscher reassigned NUTCH-2496:
---

Assignee: Lewis John McGibbney

> Speed up link inversion step in crawling script
> ---
>
> Key: NUTCH-2496
> URL: https://issues.apache.org/jira/browse/NUTCH-2496
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>
> While working on a project where I have to index a huge number of URLs I 
> encountered an issue with the link inversion step of the crawling script. A 
> while ago Ian Lopata stumbled upon the same issue as described here: 
> http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html
> {quote}
> I am running the invertlinks step in my Nutch 1.6 based crawl process on a 
> single node.  I run invertlinks only because I need the Inlinks in the 
> indexer step so as to store them with the document.  I do not need the 
> anchor text and I am not scoring.  I am finding that invertlinks (and more 
> specifically the merge of the linkdb) takes a long time - about 30 minutes 
> for a crawl of around 150K documents.  I am looking for ways that I might 
> shorten this processing time.  Any suggestions? 
> {quote}
> Back then [~wastl-nagel] suggested turning off the normalizers and filters 
> during the inversion step which speeds up the process a bunch.
> In my case however I kind of depend on those so this is no real solution.
> I opened this issue here in order to get some feedback on how we could 
> improve things in a crawl script and speed up the process.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2496) Speed up link inversion step in crawling script

2018-01-12 Thread Moreno Feltscher (JIRA)
Moreno Feltscher created NUTCH-2496:
---

 Summary: Speed up link inversion step in crawling script
 Key: NUTCH-2496
 URL: https://issues.apache.org/jira/browse/NUTCH-2496
 Project: Nutch
  Issue Type: Improvement
Reporter: Moreno Feltscher


While working on a project where I have to index a huge number of URLs I 
encountered an issue with the link inversion step of the crawling script. A 
while ago Ian Lopata stumbled upon the same issue as described here: 
http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html
{quote}
I am running the invertlinks step in my Nutch 1.6 based crawl process on a 
single node.  I run invertlinks only because I need the Inlinks in the 
indexer step so as to store them with the document.  I do not need the 
anchor text and I am not scoring.  I am finding that invertlinks (and more 
specifically the merge of the linkdb) takes a long time - about 30 minutes 
for a crawl of around 150K documents.  I am looking for ways that I might 
shorten this processing time.  Any suggestions? 
{quote}

Back then [~wastl-nagel] suggested turning off the normalizers and filters 
during the inversion step which speeds up the process a bunch.
In my case however I kind of depend on those so this is no real solution.

I opened this issue here in order to get some feedback on how we could improve 
things in a crawl script and speed up the process.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2495) Use -deleteGone instead of clean job in crawler script while indexing

2018-01-12 Thread Moreno Feltscher (JIRA)
Moreno Feltscher created NUTCH-2495:
---

 Summary: Use -deleteGone instead of clean job in crawler script 
while indexing
 Key: NUTCH-2495
 URL: https://issues.apache.org/jira/browse/NUTCH-2495
 Project: Nutch
  Issue Type: Improvement
Reporter: Moreno Feltscher
Assignee: Moreno Feltscher


Instead of running {{bin/nutch clean}} after indexing the documents run 
{{bin/nutch index}} with the {{-deleteGone}} flag which instead of just 
deleting gone and duplicated documents also deletes redirects from the index.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2018-01-11 Thread Moreno Feltscher (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16323026#comment-16323026
 ] 

Moreno Feltscher commented on NUTCH-1129:
-

[~lewismc]: Thanks for merging! A special thank you goes out to my amazing 
co-workers who did a great job on this :-) cc [~thilohaas]

> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Moreno Feltscher (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347742#comment-16347742
 ] 

Moreno Feltscher commented on NUTCH-2466:
-

I absolutely get your point and I'm a 100% with you on this - forever is not a 
good idea in any scenario :-) Just wanted to make sure I understand this change 
correctly.
FYI, Google Chrome treats 21 redirects as "too many" - I'm going to use 20 for 
{{sitemap.redir.max}} in my setup => 
https://stackoverflow.com/a/36041063/5884584

> Sitemap processor to follow redirects
> -
>
> Key: NUTCH-2466
> URL: https://issues.apache.org/jira/browse/NUTCH-2466
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. 
> sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2508) Misleading documentation about http.proxy.exception.list

2018-01-31 Thread Moreno Feltscher (JIRA)
Moreno Feltscher created NUTCH-2508:
---

 Summary: Misleading documentation about http.proxy.exception.list
 Key: NUTCH-2508
 URL: https://issues.apache.org/jira/browse/NUTCH-2508
 Project: Nutch
  Issue Type: Bug
Reporter: Moreno Feltscher
Assignee: Moreno Feltscher


The description about {{http.proxy.exception.list}} states that domains as well 
as URLs can be configured to be excluded from being routed through a 
pre-configured proxy. This is misleading since only hosts are being checked 
when using this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-31 Thread Moreno Feltscher (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347718#comment-16347718
 ] 

Moreno Feltscher commented on NUTCH-2466:
-

Is there any way to configure this so that nutch follows redirects forever 
(which was the case before this patch)?

> Sitemap processor to follow redirects
> -
>
> Key: NUTCH-2466
> URL: https://issues.apache.org/jira/browse/NUTCH-2466
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. 
> sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2490) Sitemap processing: Sitemap index files not working

2018-01-02 Thread Moreno Feltscher (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moreno Feltscher updated NUTCH-2490:

Description: The [sitemap processing 
feature|https://wiki.apache.org/nutch/SitemapFeature] does not properly handle 
sitemap index files due to a unnecessary conditional.  (was: The [sitemap 
processing feature](https://wiki.apache.org/nutch/SitemapFeature) does not 
properly handle sitemap index files due to a unnecessary conditional.)

> Sitemap processing: Sitemap index files not working
> ---
>
> Key: NUTCH-2490
> URL: https://issues.apache.org/jira/browse/NUTCH-2490
> Project: Nutch
>  Issue Type: Bug
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>
> The [sitemap processing feature|https://wiki.apache.org/nutch/SitemapFeature] 
> does not properly handle sitemap index files due to a unnecessary conditional.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2492) Add more configuration parameters to crawl script

2018-01-03 Thread Moreno Feltscher (JIRA)
Moreno Feltscher created NUTCH-2492:
---

 Summary: Add more configuration parameters to crawl script 
 Key: NUTCH-2492
 URL: https://issues.apache.org/jira/browse/NUTCH-2492
 Project: Nutch
  Issue Type: New Feature
Reporter: Moreno Feltscher
Assignee: Moreno Feltscher


Instead of having to copy and adjust the crawl script in order to specify the 
following configuration options allow the user to pass them in using arguments:
- numSlaves
- numTasks
- sizeFetchlist
- timeLimitFetch
- numThreads



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2490) Sitemap processing: Sitemap index files not working

2018-01-02 Thread Moreno Feltscher (JIRA)
Moreno Feltscher created NUTCH-2490:
---

 Summary: Sitemap processing: Sitemap index files not working
 Key: NUTCH-2490
 URL: https://issues.apache.org/jira/browse/NUTCH-2490
 Project: Nutch
  Issue Type: Bug
Reporter: Moreno Feltscher
Assignee: Moreno Feltscher


The [sitemap processing feature](https://wiki.apache.org/nutch/SitemapFeature) 
does not properly handle sitemap index files due to a unnecessary conditional.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-01-22 Thread Moreno Feltscher (JIRA)
Moreno Feltscher created NUTCH-2501:
---

 Summary: Take into account $NUTCH_HEAPSIZE when crawling using 
crawl script
 Key: NUTCH-2501
 URL: https://issues.apache.org/jira/browse/NUTCH-2501
 Project: Nutch
  Issue Type: Improvement
Reporter: Moreno Feltscher
Assignee: Moreno Feltscher






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (NUTCH-2502) Any23 Plugin: Add Content-Type filtering

2018-01-23 Thread Moreno Feltscher (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moreno Feltscher reassigned NUTCH-2502:
---

Assignee: Lewis John McGibbney  (was: Moreno Feltscher)

> Any23 Plugin: Add Content-Type filtering
> 
>
> Key: NUTCH-2502
> URL: https://issues.apache.org/jira/browse/NUTCH-2502
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>
> It should be possible to filter based on a document's Content-Type when using 
> Any23 extractors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-01-23 Thread Moreno Feltscher (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moreno Feltscher reassigned NUTCH-2501:
---

Assignee: Lewis John McGibbney  (was: Moreno Feltscher)

> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (NUTCH-2495) Use -deleteGone instead of clean job in crawler script while indexing

2018-01-23 Thread Moreno Feltscher (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moreno Feltscher reassigned NUTCH-2495:
---

Assignee: Lewis John McGibbney  (was: Moreno Feltscher)

> Use -deleteGone instead of clean job in crawler script while indexing
> -
>
> Key: NUTCH-2495
> URL: https://issues.apache.org/jira/browse/NUTCH-2495
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>
> Instead of running {{bin/nutch clean}} after indexing the documents run 
> {{bin/nutch index}} with the {{-deleteGone}} flag which instead of just 
> deleting gone and duplicated documents also deletes redirects from the index.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (NUTCH-2499) Elastic REST Indexer: Duplicate values

2018-01-23 Thread Moreno Feltscher (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moreno Feltscher reassigned NUTCH-2499:
---

Assignee: Lewis John McGibbney  (was: Moreno Feltscher)

> Elastic REST Indexer: Duplicate values
> --
>
> Key: NUTCH-2499
> URL: https://issues.apache.org/jira/browse/NUTCH-2499
> Project: Nutch
>  Issue Type: Bug
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>
> Due to a change in 
> https://github.com/apache/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e#diff-408fd2f17bc9791dcbf531ffe6574a6a
>  the Elastic REST indexer does not work with HashSets for values anymore but 
> instead saves duplicated values as arrays.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2503) Add option to run tests for a single plugin

2018-01-23 Thread Moreno Feltscher (JIRA)
Moreno Feltscher created NUTCH-2503:
---

 Summary: Add option to run tests for a single plugin
 Key: NUTCH-2503
 URL: https://issues.apache.org/jira/browse/NUTCH-2503
 Project: Nutch
  Issue Type: Improvement
Reporter: Moreno Feltscher
Assignee: Moreno Feltscher


Sometimes it makes sense to just run tests for a single plugin instead of 
building all plugins and running all tests at once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-01-23 Thread Moreno Feltscher (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335999#comment-16335999
 ] 

Moreno Feltscher commented on NUTCH-2501:
-

Pull request: https://github.com/apache/nutch/pull/279

> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2503) Add option to run tests for a single plugin

2018-01-23 Thread Moreno Feltscher (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335991#comment-16335991
 ] 

Moreno Feltscher commented on NUTCH-2503:
-

Pull request: https://github.com/apache/nutch/pull/281

> Add option to run tests for a single plugin
> ---
>
> Key: NUTCH-2503
> URL: https://issues.apache.org/jira/browse/NUTCH-2503
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
>
> Sometimes it makes sense to just run tests for a single plugin instead of 
> building all plugins and running all tests at once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2502) Any23 Plugin: Add Content-Type filtering

2018-01-23 Thread Moreno Feltscher (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335994#comment-16335994
 ] 

Moreno Feltscher commented on NUTCH-2502:
-

Pull request: https://github.com/apache/nutch/pull/280

> Any23 Plugin: Add Content-Type filtering
> 
>
> Key: NUTCH-2502
> URL: https://issues.apache.org/jira/browse/NUTCH-2502
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
>
> It should be possible to filter based on a document's Content-Type when using 
> Any23 extractors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2755) Remove obsolete plugin indexer-elastic-rest

2020-06-09 Thread Moreno Feltscher (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129892#comment-17129892
 ] 

Moreno Feltscher commented on NUTCH-2755:
-

[~snagel]: Is there an example on how to use the document routing feature in 
order to store documents in different indices based on their language?

> Remove obsolete plugin indexer-elastic-rest
> ---
>
> Key: NUTCH-2755
> URL: https://issues.apache.org/jira/browse/NUTCH-2755
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.17
>Reporter: Sebastian Nagel
>Assignee: Shashanka Balakuntala Srinivasa
>Priority: Major
> Fix For: 1.17
>
>
> With NUTCH-2739 the plugin indexer-elastic uses the [REST 
> client|https://www.elastic.co/guide/en/elasticsearch/client/java-rest/7.3/java-rest-high.html]
>  instead of the deprecated 
> [TransportClient|https://www.elastic.co/guide/en/elasticsearch/client/java-api/7.3/transport-client.html].
>  This obsoletes the separate REST-based plugin indexer-elastic-rest.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)