[jira] [Created] (NUTCH-2403) Nutch Selenium: Wrong documentation about PhantomJS
Moreno Feltscher created NUTCH-2403: --- Summary: Nutch Selenium: Wrong documentation about PhantomJS Key: NUTCH-2403 URL: https://issues.apache.org/jira/browse/NUTCH-2403 Project: Nutch Issue Type: Bug Reporter: Moreno Feltscher The Nutch Selenium documentation states that PhantomJS can be used as {{phantomJS}} for {{selenium.driver}}. The correct value would be {{phantomjs}} according to https://github.com/apache/nutch/blob/master/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java#L124 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2486) Compiler Warning: Unchecked / unsafe operations in MimeTypeIndexingFilter
Moreno Feltscher created NUTCH-2486: --- Summary: Compiler Warning: Unchecked / unsafe operations in MimeTypeIndexingFilter Key: NUTCH-2486 URL: https://issues.apache.org/jira/browse/NUTCH-2486 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.14 Reporter: Moreno Feltscher When compiling Nutch source, the following warning is being shown: {quote} Note: src/plugin/mimetype-filter/src/java/org/apache/nutch/indexer/filter/MimeTypeIndexingFilter.java uses unchecked or unsafe operations. {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2473) Elasticsearch REST Indexer broken due to wrong depenency
Moreno Feltscher created NUTCH-2473: --- Summary: Elasticsearch REST Indexer broken due to wrong depenency Key: NUTCH-2473 URL: https://issues.apache.org/jira/browse/NUTCH-2473 Project: Nutch Issue Type: Bug Affects Versions: 1.14 Reporter: Moreno Feltscher When trying to index into Elasticsearch using {{indexer-elastic-rest}} the following error is being thrown: {code} Exception in thread "main" java.lang.LinkageError: loader constraint violation: when resolving method "org.slf4j.impl.StaticLoggerBinder.getLoggerFactory()Lorg/slf4j/ILoggerFactory;" the class loader (instance of org/apache/nutch/plugin/PluginClassLoader) of the current class, org/slf4j/LoggerFactory, and the class loader (instance of sun/misc/Launcher$AppClassLoader) for the method's defining class, org/slf4j/impl/StaticLoggerBinder, have different Class objects for the type org/slf4j/ILoggerFactory used in the signature at org.slf4j.LoggerFactory.getILoggerFactory(LoggerFactory.java:418) at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:357) at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:383) at org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter.(ElasticRestIndexWriter.java:71) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:161) at org.apache.nutch.indexer.IndexWriters.(IndexWriters.java:57) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239) {code} [e66d44d|https://github.com/apache/nutch/commit/e66d44d9c290c550e78edb425a43e010b861172c#diff-aefa48b9ce916d2e33dc27b153c44977] removed the runtime dependency on {{slf4j-api-1.7.21.jar}} everywhere but in {{indexer-elastic-rest}}. Possible fix: https://github.com/apache/nutch/pull/253 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (NUTCH-2473) Elasticsearch REST Indexer broken due to wrong depenency
[ https://issues.apache.org/jira/browse/NUTCH-2473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Moreno Feltscher reassigned NUTCH-2473: --- Assignee: Sebastian Nagel > Elasticsearch REST Indexer broken due to wrong depenency > > > Key: NUTCH-2473 > URL: https://issues.apache.org/jira/browse/NUTCH-2473 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Moreno Feltscher >Assignee: Sebastian Nagel > > When trying to index into Elasticsearch using {{indexer-elastic-rest}} the > following error is being thrown: > {code} > Exception in thread "main" java.lang.LinkageError: loader constraint > violation: when resolving method > "org.slf4j.impl.StaticLoggerBinder.getLoggerFactory()Lorg/slf4j/ILoggerFactory;" > the class loader (instance of org/apache/nutch/plugin/PluginClassLoader) of > the current class, org/slf4j/LoggerFactory, and the class loader (instance of > sun/misc/Launcher$AppClassLoader) for the method's defining class, > org/slf4j/impl/StaticLoggerBinder, have different Class objects for the type > org/slf4j/ILoggerFactory used in the signature > at org.slf4j.LoggerFactory.getILoggerFactory(LoggerFactory.java:418) > at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:357) > at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:383) > at > org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter.(ElasticRestIndexWriter.java:71) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at java.lang.Class.newInstance(Class.java:442) > at > org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:161) > at org.apache.nutch.indexer.IndexWriters.(IndexWriters.java:57) > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123) > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239) > {code} > [e66d44d|https://github.com/apache/nutch/commit/e66d44d9c290c550e78edb425a43e010b861172c#diff-aefa48b9ce916d2e33dc27b153c44977] > removed the runtime dependency on {{slf4j-api-1.7.21.jar}} everywhere but in > {{indexer-elastic-rest}}. > Possible fix: https://github.com/apache/nutch/pull/253 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2493) Add configuration parameter for sitemap processing to crawler script
Moreno Feltscher created NUTCH-2493: --- Summary: Add configuration parameter for sitemap processing to crawler script Key: NUTCH-2493 URL: https://issues.apache.org/jira/browse/NUTCH-2493 Project: Nutch Issue Type: Improvement Reporter: Moreno Feltscher Assignee: Moreno Feltscher -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2491) Integrate sitemap processing and HostDB into crawl script
Moreno Feltscher created NUTCH-2491: --- Summary: Integrate sitemap processing and HostDB into crawl script Key: NUTCH-2491 URL: https://issues.apache.org/jira/browse/NUTCH-2491 Project: Nutch Issue Type: Improvement Reporter: Moreno Feltscher Assignee: Moreno Feltscher Priority: Minor Add three new steps to the crawl bash script: 1. Generate HostDB from CrawlDB 2. Inject URLs from sitemaps URLs found in hosts from HostDb 3. If given, inject sitemap URLs specified in a configuration file / in configuration files -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2493) Add configuration parameter for sitemap processing to crawler script
[ https://issues.apache.org/jira/browse/NUTCH-2493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Moreno Feltscher updated NUTCH-2493: Description: While using the crawler script with the sitemap processing feature introduced in NUTCH-2491 I encountered some performance issues when working with large sitemaps. Therefore one should be able to specify if sitemap processing based on HostDB should take place and if so how frequently it should be done. > Add configuration parameter for sitemap processing to crawler script > > > Key: NUTCH-2493 > URL: https://issues.apache.org/jira/browse/NUTCH-2493 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher > > While using the crawler script with the sitemap processing feature introduced > in NUTCH-2491 I encountered some performance issues when working with large > sitemaps. > Therefore one should be able to specify if sitemap processing based on HostDB > should take place and if so how frequently it should be done. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2499) Elastic REST Indexer: Duplicate values
[ https://issues.apache.org/jira/browse/NUTCH-2499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Moreno Feltscher updated NUTCH-2499: Description: Due to a change in https://github.com/smartive/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e the Elastic REST indexer does not work with HashSets for values anymore but instead saves duplicated values as arrays. > Elastic REST Indexer: Duplicate values > -- > > Key: NUTCH-2499 > URL: https://issues.apache.org/jira/browse/NUTCH-2499 > Project: Nutch > Issue Type: Bug >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher >Priority: Major > > Due to a change in > https://github.com/smartive/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e > the Elastic REST indexer does not work with HashSets for values anymore but > instead saves duplicated values as arrays. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2499) Elastic REST Indexer: Duplicate values
[ https://issues.apache.org/jira/browse/NUTCH-2499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Moreno Feltscher updated NUTCH-2499: Description: Due to a change in https://github.com/apache/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e#diff-408fd2f17bc9791dcbf531ffe6574a6a the Elastic REST indexer does not work with HashSets for values anymore but instead saves duplicated values as arrays. (was: Due to a change in https://github.com/smartive/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e the Elastic REST indexer does not work with HashSets for values anymore but instead saves duplicated values as arrays.) > Elastic REST Indexer: Duplicate values > -- > > Key: NUTCH-2499 > URL: https://issues.apache.org/jira/browse/NUTCH-2499 > Project: Nutch > Issue Type: Bug >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher >Priority: Major > > Due to a change in > https://github.com/apache/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e#diff-408fd2f17bc9791dcbf531ffe6574a6a > the Elastic REST indexer does not work with HashSets for values anymore but > instead saves duplicated values as arrays. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script
[ https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326640#comment-16326640 ] Moreno Feltscher commented on NUTCH-2496: - [~markus17]: Thanks for that hint. This is something I still don't really get. Where and to what steps exactly are those filters/normalizers being applied? In my case I only have a {{regex-urlfilter.txt}} file as well as the following plugin configuration: {code:xml} plugin.includes protocol-httpclient|protocol-http|urlfilter-regex|index-(basic|anchor|metadata)|headings|language-identifier|query-(basic|site|url|lang)|indexer-elastic-rest|parse-(text|html|tika|metatags)|urlnormalizer-(pass|regex|basic) {code} Would it make sense to disable filtering/normalization in LinkDB? > Speed up link inversion step in crawling script > --- > > Key: NUTCH-2496 > URL: https://issues.apache.org/jira/browse/NUTCH-2496 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > > While working on a project where I have to index a huge number of URLs I > encountered an issue with the link inversion step of the crawling script. A > while ago Ian Lopata stumbled upon the same issue as described here: > http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html > {quote} > I am running the invertlinks step in my Nutch 1.6 based crawl process on a > single node. I run invertlinks only because I need the Inlinks in the > indexer step so as to store them with the document. I do not need the > anchor text and I am not scoring. I am finding that invertlinks (and more > specifically the merge of the linkdb) takes a long time - about 30 minutes > for a crawl of around 150K documents. I am looking for ways that I might > shorten this processing time. Any suggestions? > {quote} > Back then [~wastl-nagel] suggested turning off the normalizers and filters > during the inversion step which speeds up the process a bunch. > In my case however I kind of depend on those so this is no real solution. > I opened this issue here in order to get some feedback on how we could > improve things in a crawl script and speed up the process. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2502) Any23 Plugin: Add Content-Type filtering
Moreno Feltscher created NUTCH-2502: --- Summary: Any23 Plugin: Add Content-Type filtering Key: NUTCH-2502 URL: https://issues.apache.org/jira/browse/NUTCH-2502 Project: Nutch Issue Type: Improvement Reporter: Moreno Feltscher Assignee: Moreno Feltscher It should be possible to filter based on a document's Content-Type when using Any23 extractors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2499) Elastic REST Indexer: Duplicate values
[ https://issues.apache.org/jira/browse/NUTCH-2499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Moreno Feltscher updated NUTCH-2499: Environment: (was: Due to a change in https://github.com/smartive/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e the Elastic REST indexer does not work with HashSets for values anymore but instead saves duplicated values as arrays.) > Elastic REST Indexer: Duplicate values > -- > > Key: NUTCH-2499 > URL: https://issues.apache.org/jira/browse/NUTCH-2499 > Project: Nutch > Issue Type: Bug >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2499) Elastic REST Indexer: Duplicate values
Moreno Feltscher created NUTCH-2499: --- Summary: Elastic REST Indexer: Duplicate values Key: NUTCH-2499 URL: https://issues.apache.org/jira/browse/NUTCH-2499 Project: Nutch Issue Type: Bug Environment: Due to a change in https://github.com/smartive/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e the Elastic REST indexer does not work with HashSets for values anymore but instead saves duplicated values as arrays. Reporter: Moreno Feltscher Assignee: Moreno Feltscher -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script
[ https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329760#comment-16329760 ] Moreno Feltscher commented on NUTCH-2496: - Thanks again for clearing things up even more. One last question about the "changing normalizers and/or filters" though: What happens if I change let's say my filters and after that I do a full re-crawl (inject - generate - fetch - parse - update - link inversion - index - index cleanup) without having filtering turned on in my link inversion step? Would Nutch take into account the new filters and eventually drop documents that do not match the filters anymore from my index? > Speed up link inversion step in crawling script > --- > > Key: NUTCH-2496 > URL: https://issues.apache.org/jira/browse/NUTCH-2496 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > > While working on a project where I have to index a huge number of URLs I > encountered an issue with the link inversion step of the crawling script. A > while ago Ian Lopata stumbled upon the same issue as described here: > http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html > {quote} > I am running the invertlinks step in my Nutch 1.6 based crawl process on a > single node. I run invertlinks only because I need the Inlinks in the > indexer step so as to store them with the document. I do not need the > anchor text and I am not scoring. I am finding that invertlinks (and more > specifically the merge of the linkdb) takes a long time - about 30 minutes > for a crawl of around 150K documents. I am looking for ways that I might > shorten this processing time. Any suggestions? > {quote} > Back then [~wastl-nagel] suggested turning off the normalizers and filters > during the inversion step which speeds up the process a bunch. > In my case however I kind of depend on those so this is no real solution. > I opened this issue here in order to get some feedback on how we could > improve things in a crawl script and speed up the process. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2497) Elastic REST Indexer: Allow multiple hosts
Moreno Feltscher created NUTCH-2497: --- Summary: Elastic REST Indexer: Allow multiple hosts Key: NUTCH-2497 URL: https://issues.apache.org/jira/browse/NUTCH-2497 Project: Nutch Issue Type: Improvement Reporter: Moreno Feltscher Assignee: Moreno Feltscher Allow specifying a list of Elasticsearch hosts to index documents to. This would be especially helpful when working with a Elasticsearch cluster which contains of multiple nodes. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script
[ https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324737#comment-16324737 ] Moreno Feltscher commented on NUTCH-2496: - One thing I found out is that if I do the link inversion step after all the iterations are done it takes a lot less time. Would it be feasible to move the link inversion and indexing step out of the loop and do it only once in the end? Any thoughts about this? > Speed up link inversion step in crawling script > --- > > Key: NUTCH-2496 > URL: https://issues.apache.org/jira/browse/NUTCH-2496 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney > > While working on a project where I have to index a huge number of URLs I > encountered an issue with the link inversion step of the crawling script. A > while ago Ian Lopata stumbled upon the same issue as described here: > http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html > {quote} > I am running the invertlinks step in my Nutch 1.6 based crawl process on a > single node. I run invertlinks only because I need the Inlinks in the > indexer step so as to store them with the document. I do not need the > anchor text and I am not scoring. I am finding that invertlinks (and more > specifically the merge of the linkdb) takes a long time - about 30 minutes > for a crawl of around 150K documents. I am looking for ways that I might > shorten this processing time. Any suggestions? > {quote} > Back then [~wastl-nagel] suggested turning off the normalizers and filters > during the inversion step which speeds up the process a bunch. > In my case however I kind of depend on those so this is no real solution. > I opened this issue here in order to get some feedback on how we could > improve things in a crawl script and speed up the process. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (NUTCH-2496) Speed up link inversion step in crawling script
[ https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Moreno Feltscher reassigned NUTCH-2496: --- Assignee: Lewis John McGibbney > Speed up link inversion step in crawling script > --- > > Key: NUTCH-2496 > URL: https://issues.apache.org/jira/browse/NUTCH-2496 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney > > While working on a project where I have to index a huge number of URLs I > encountered an issue with the link inversion step of the crawling script. A > while ago Ian Lopata stumbled upon the same issue as described here: > http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html > {quote} > I am running the invertlinks step in my Nutch 1.6 based crawl process on a > single node. I run invertlinks only because I need the Inlinks in the > indexer step so as to store them with the document. I do not need the > anchor text and I am not scoring. I am finding that invertlinks (and more > specifically the merge of the linkdb) takes a long time - about 30 minutes > for a crawl of around 150K documents. I am looking for ways that I might > shorten this processing time. Any suggestions? > {quote} > Back then [~wastl-nagel] suggested turning off the normalizers and filters > during the inversion step which speeds up the process a bunch. > In my case however I kind of depend on those so this is no real solution. > I opened this issue here in order to get some feedback on how we could > improve things in a crawl script and speed up the process. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2496) Speed up link inversion step in crawling script
Moreno Feltscher created NUTCH-2496: --- Summary: Speed up link inversion step in crawling script Key: NUTCH-2496 URL: https://issues.apache.org/jira/browse/NUTCH-2496 Project: Nutch Issue Type: Improvement Reporter: Moreno Feltscher While working on a project where I have to index a huge number of URLs I encountered an issue with the link inversion step of the crawling script. A while ago Ian Lopata stumbled upon the same issue as described here: http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html {quote} I am running the invertlinks step in my Nutch 1.6 based crawl process on a single node. I run invertlinks only because I need the Inlinks in the indexer step so as to store them with the document. I do not need the anchor text and I am not scoring. I am finding that invertlinks (and more specifically the merge of the linkdb) takes a long time - about 30 minutes for a crawl of around 150K documents. I am looking for ways that I might shorten this processing time. Any suggestions? {quote} Back then [~wastl-nagel] suggested turning off the normalizers and filters during the inversion step which speeds up the process a bunch. In my case however I kind of depend on those so this is no real solution. I opened this issue here in order to get some feedback on how we could improve things in a crawl script and speed up the process. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2495) Use -deleteGone instead of clean job in crawler script while indexing
Moreno Feltscher created NUTCH-2495: --- Summary: Use -deleteGone instead of clean job in crawler script while indexing Key: NUTCH-2495 URL: https://issues.apache.org/jira/browse/NUTCH-2495 Project: Nutch Issue Type: Improvement Reporter: Moreno Feltscher Assignee: Moreno Feltscher Instead of running {{bin/nutch clean}} after indexing the documents run {{bin/nutch index}} with the {{-deleteGone}} flag which instead of just deleting gone and duplicated documents also deletes redirects from the index. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin
[ https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16323026#comment-16323026 ] Moreno Feltscher commented on NUTCH-1129: - [~lewismc]: Thanks for merging! A special thank you goes out to my amazing co-workers who did a great job on this :-) cc [~thilohaas] > Any23 Nutch plugin > -- > > Key: NUTCH-1129 > URL: https://issues.apache.org/jira/browse/NUTCH-1129 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-1129.patch > > > This plugin should build on the Any23 library to provide us with a plugin > which extracts RDF data from HTTP and file resources. Although as of writing > Any23 not part of the ASF, the project is working towards integration into > the Apache Incubator. Once the project proves its value, this would be an > excellent addition to the Nutch 1.X codebase. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347742#comment-16347742 ] Moreno Feltscher commented on NUTCH-2466: - I absolutely get your point and I'm a 100% with you on this - forever is not a good idea in any scenario :-) Just wanted to make sure I understand this change correctly. FYI, Google Chrome treats 21 redirects as "too many" - I'm going to use 20 for {{sitemap.redir.max}} in my setup => https://stackoverflow.com/a/36041063/5884584 > Sitemap processor to follow redirects > - > > Key: NUTCH-2466 > URL: https://issues.apache.org/jira/browse/NUTCH-2466 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch > > > It does follow http > https, but not the following redirect, e.g. > sitemap_index.xml that some websites have. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2508) Misleading documentation about http.proxy.exception.list
Moreno Feltscher created NUTCH-2508: --- Summary: Misleading documentation about http.proxy.exception.list Key: NUTCH-2508 URL: https://issues.apache.org/jira/browse/NUTCH-2508 Project: Nutch Issue Type: Bug Reporter: Moreno Feltscher Assignee: Moreno Feltscher The description about {{http.proxy.exception.list}} states that domains as well as URLs can be configured to be excluded from being routed through a pre-configured proxy. This is misleading since only hosts are being checked when using this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347718#comment-16347718 ] Moreno Feltscher commented on NUTCH-2466: - Is there any way to configure this so that nutch follows redirects forever (which was the case before this patch)? > Sitemap processor to follow redirects > - > > Key: NUTCH-2466 > URL: https://issues.apache.org/jira/browse/NUTCH-2466 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch > > > It does follow http > https, but not the following redirect, e.g. > sitemap_index.xml that some websites have. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2490) Sitemap processing: Sitemap index files not working
[ https://issues.apache.org/jira/browse/NUTCH-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Moreno Feltscher updated NUTCH-2490: Description: The [sitemap processing feature|https://wiki.apache.org/nutch/SitemapFeature] does not properly handle sitemap index files due to a unnecessary conditional. (was: The [sitemap processing feature](https://wiki.apache.org/nutch/SitemapFeature) does not properly handle sitemap index files due to a unnecessary conditional.) > Sitemap processing: Sitemap index files not working > --- > > Key: NUTCH-2490 > URL: https://issues.apache.org/jira/browse/NUTCH-2490 > Project: Nutch > Issue Type: Bug >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher > > The [sitemap processing feature|https://wiki.apache.org/nutch/SitemapFeature] > does not properly handle sitemap index files due to a unnecessary conditional. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2492) Add more configuration parameters to crawl script
Moreno Feltscher created NUTCH-2492: --- Summary: Add more configuration parameters to crawl script Key: NUTCH-2492 URL: https://issues.apache.org/jira/browse/NUTCH-2492 Project: Nutch Issue Type: New Feature Reporter: Moreno Feltscher Assignee: Moreno Feltscher Instead of having to copy and adjust the crawl script in order to specify the following configuration options allow the user to pass them in using arguments: - numSlaves - numTasks - sizeFetchlist - timeLimitFetch - numThreads -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2490) Sitemap processing: Sitemap index files not working
Moreno Feltscher created NUTCH-2490: --- Summary: Sitemap processing: Sitemap index files not working Key: NUTCH-2490 URL: https://issues.apache.org/jira/browse/NUTCH-2490 Project: Nutch Issue Type: Bug Reporter: Moreno Feltscher Assignee: Moreno Feltscher The [sitemap processing feature](https://wiki.apache.org/nutch/SitemapFeature) does not properly handle sitemap index files due to a unnecessary conditional. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
Moreno Feltscher created NUTCH-2501: --- Summary: Take into account $NUTCH_HEAPSIZE when crawling using crawl script Key: NUTCH-2501 URL: https://issues.apache.org/jira/browse/NUTCH-2501 Project: Nutch Issue Type: Improvement Reporter: Moreno Feltscher Assignee: Moreno Feltscher -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (NUTCH-2502) Any23 Plugin: Add Content-Type filtering
[ https://issues.apache.org/jira/browse/NUTCH-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Moreno Feltscher reassigned NUTCH-2502: --- Assignee: Lewis John McGibbney (was: Moreno Feltscher) > Any23 Plugin: Add Content-Type filtering > > > Key: NUTCH-2502 > URL: https://issues.apache.org/jira/browse/NUTCH-2502 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > > It should be possible to filter based on a document's Content-Type when using > Any23 extractors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Moreno Feltscher reassigned NUTCH-2501: --- Assignee: Lewis John McGibbney (was: Moreno Feltscher) > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (NUTCH-2495) Use -deleteGone instead of clean job in crawler script while indexing
[ https://issues.apache.org/jira/browse/NUTCH-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Moreno Feltscher reassigned NUTCH-2495: --- Assignee: Lewis John McGibbney (was: Moreno Feltscher) > Use -deleteGone instead of clean job in crawler script while indexing > - > > Key: NUTCH-2495 > URL: https://issues.apache.org/jira/browse/NUTCH-2495 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > > Instead of running {{bin/nutch clean}} after indexing the documents run > {{bin/nutch index}} with the {{-deleteGone}} flag which instead of just > deleting gone and duplicated documents also deletes redirects from the index. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (NUTCH-2499) Elastic REST Indexer: Duplicate values
[ https://issues.apache.org/jira/browse/NUTCH-2499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Moreno Feltscher reassigned NUTCH-2499: --- Assignee: Lewis John McGibbney (was: Moreno Feltscher) > Elastic REST Indexer: Duplicate values > -- > > Key: NUTCH-2499 > URL: https://issues.apache.org/jira/browse/NUTCH-2499 > Project: Nutch > Issue Type: Bug >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > > Due to a change in > https://github.com/apache/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e#diff-408fd2f17bc9791dcbf531ffe6574a6a > the Elastic REST indexer does not work with HashSets for values anymore but > instead saves duplicated values as arrays. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2503) Add option to run tests for a single plugin
Moreno Feltscher created NUTCH-2503: --- Summary: Add option to run tests for a single plugin Key: NUTCH-2503 URL: https://issues.apache.org/jira/browse/NUTCH-2503 Project: Nutch Issue Type: Improvement Reporter: Moreno Feltscher Assignee: Moreno Feltscher Sometimes it makes sense to just run tests for a single plugin instead of building all plugins and running all tests at once. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335999#comment-16335999 ] Moreno Feltscher commented on NUTCH-2501: - Pull request: https://github.com/apache/nutch/pull/279 > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2503) Add option to run tests for a single plugin
[ https://issues.apache.org/jira/browse/NUTCH-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335991#comment-16335991 ] Moreno Feltscher commented on NUTCH-2503: - Pull request: https://github.com/apache/nutch/pull/281 > Add option to run tests for a single plugin > --- > > Key: NUTCH-2503 > URL: https://issues.apache.org/jira/browse/NUTCH-2503 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher >Priority: Major > > Sometimes it makes sense to just run tests for a single plugin instead of > building all plugins and running all tests at once. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2502) Any23 Plugin: Add Content-Type filtering
[ https://issues.apache.org/jira/browse/NUTCH-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335994#comment-16335994 ] Moreno Feltscher commented on NUTCH-2502: - Pull request: https://github.com/apache/nutch/pull/280 > Any23 Plugin: Add Content-Type filtering > > > Key: NUTCH-2502 > URL: https://issues.apache.org/jira/browse/NUTCH-2502 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher >Priority: Major > > It should be possible to filter based on a document's Content-Type when using > Any23 extractors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2755) Remove obsolete plugin indexer-elastic-rest
[ https://issues.apache.org/jira/browse/NUTCH-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129892#comment-17129892 ] Moreno Feltscher commented on NUTCH-2755: - [~snagel]: Is there an example on how to use the document routing feature in order to store documents in different indices based on their language? > Remove obsolete plugin indexer-elastic-rest > --- > > Key: NUTCH-2755 > URL: https://issues.apache.org/jira/browse/NUTCH-2755 > Project: Nutch > Issue Type: Improvement > Components: indexer, plugin >Affects Versions: 1.17 >Reporter: Sebastian Nagel >Assignee: Shashanka Balakuntala Srinivasa >Priority: Major > Fix For: 1.17 > > > With NUTCH-2739 the plugin indexer-elastic uses the [REST > client|https://www.elastic.co/guide/en/elasticsearch/client/java-rest/7.3/java-rest-high.html] > instead of the deprecated > [TransportClient|https://www.elastic.co/guide/en/elasticsearch/client/java-api/7.3/transport-client.html]. > This obsoletes the separate REST-based plugin indexer-elastic-rest. -- This message was sent by Atlassian Jira (v8.3.4#803005)