[jira] [Resolved] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1284. Resolution: Fixed Add site fetcher.max.crawl.delay as log output by default. -- Key: NUTCH-1284 URL: https://issues.apache.org/jira/browse/NUTCH-1284 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Tejas Patil Priority: Trivial Fix For: 1.7, 2.2 Attachments: NUTCH-1284-2.x.v1.patch, NUTCH-1284.patch, NUTCH-1284-trunk.v1.patch Currently, when manually scanning our log output we cannot infer which pages are governed by a crawl delay between successive fetch attempts of any given page within the site. The value should be made available as something like: {code} 2012-02-19 12:33:33,031 INFO fetcher.Fetcher - fetching http://nutch.apache.org/ (crawl.delay=XXXms) {code} This way we can easily and quickly determine whether the fetcher is having to use this functionality or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564107#comment-13564107 ] Tejas Patil commented on NUTCH-1284: Committed @revision 1439289 in trunk Committed @revision 1439291 in 2.x Add site fetcher.max.crawl.delay as log output by default. -- Key: NUTCH-1284 URL: https://issues.apache.org/jira/browse/NUTCH-1284 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Tejas Patil Priority: Trivial Fix For: 1.7, 2.2 Attachments: NUTCH-1284-2.x.v1.patch, NUTCH-1284.patch, NUTCH-1284-trunk.v1.patch Currently, when manually scanning our log output we cannot infer which pages are governed by a crawl delay between successive fetch attempts of any given page within the site. The value should be made available as something like: {code} 2012-02-19 12:33:33,031 INFO fetcher.Fetcher - fetching http://nutch.apache.org/ (crawl.delay=XXXms) {code} This way we can easily and quickly determine whether the fetcher is having to use this functionality or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1
[ https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1042. Resolution: Fixed The fix for NUTCH-1284 takes care of this. Fetcher.max.crawl.delay property not taken into account correctly when set to -1 Key: NUTCH-1042 URL: https://issues.apache.org/jira/browse/NUTCH-1042 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.3 Reporter: Nutch User - 1 Assignee: Lewis John McGibbney Fix For: 1.7, 2.2 [Originally: (http://lucene.472066.n3.nabble.com/A-possible-bug-or-misleading-documentation-td3162397.html).] From nutch-default.xml: property namefetcher.max.crawl.delay/name value30/value description If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip this page, generating an error report. If set to -1 the fetcher will never skip such pages and will wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might be. /description /property Fetcher.java: (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup). The line 554 in Fetcher.java: this.maxCrawlDelay = conf.getInt(fetcher.max.crawl.delay, 30) * 1000; . The lines 615-616 in Fetcher.java: if (rules.getCrawlDelay() 0) { if (rules.getCrawlDelay() maxCrawlDelay) { Now, the documentation states that, if fetcher.max.crawl.delay is set to -1, the crawler will always wait the amount of time the Crawl-Delay parameter specifies. However, as you can see, if it really is negative the condition on the line 616 is always true, which leads to skipping the page whose Crawl-Delay is set. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564110#comment-13564110 ] Hudson commented on NUTCH-1284: --- Integrated in Nutch-trunk-Windows #18 (See [https://builds.apache.org/job/Nutch-trunk-Windows/18/]) NUTCH-1284 Add site fetcher.max.crawl.delay as log output by default (Revision 1439289) Result = FAILURE tejasp : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1439289 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java Add site fetcher.max.crawl.delay as log output by default. -- Key: NUTCH-1284 URL: https://issues.apache.org/jira/browse/NUTCH-1284 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Tejas Patil Priority: Trivial Fix For: 1.7, 2.2 Attachments: NUTCH-1284-2.x.v1.patch, NUTCH-1284.patch, NUTCH-1284-trunk.v1.patch Currently, when manually scanning our log output we cannot infer which pages are governed by a crawl delay between successive fetch attempts of any given page within the site. The value should be made available as something like: {code} 2012-02-19 12:33:33,031 INFO fetcher.Fetcher - fetching http://nutch.apache.org/ (crawl.delay=XXXms) {code} This way we can easily and quickly determine whether the fetcher is having to use this functionality or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564111#comment-13564111 ] Hudson commented on NUTCH-1284: --- Integrated in Nutch-2.x-Windows #18 (See [https://builds.apache.org/job/Nutch-2.x-Windows/18/]) NUTCH-1284 Add site fetcher.max.crawl.delay as log output by default (Revision 1439291) Result = FAILURE tejasp : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1439291 Files : * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java Add site fetcher.max.crawl.delay as log output by default. -- Key: NUTCH-1284 URL: https://issues.apache.org/jira/browse/NUTCH-1284 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Tejas Patil Priority: Trivial Fix For: 1.7, 2.2 Attachments: NUTCH-1284-2.x.v1.patch, NUTCH-1284.patch, NUTCH-1284-trunk.v1.patch Currently, when manually scanning our log output we cannot infer which pages are governed by a crawl delay between successive fetch attempts of any given page within the site. The value should be made available as something like: {code} 2012-02-19 12:33:33,031 INFO fetcher.Fetcher - fetching http://nutch.apache.org/ (crawl.delay=XXXms) {code} This way we can easily and quickly determine whether the fetcher is having to use this functionality or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564113#comment-13564113 ] Hudson commented on NUTCH-1284: --- Integrated in Nutch-nutchgora #478 (See [https://builds.apache.org/job/Nutch-nutchgora/478/]) NUTCH-1284 Add site fetcher.max.crawl.delay as log output by default (Revision 1439291) Result = FAILURE tejasp : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1439291 Files : * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java Add site fetcher.max.crawl.delay as log output by default. -- Key: NUTCH-1284 URL: https://issues.apache.org/jira/browse/NUTCH-1284 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Tejas Patil Priority: Trivial Fix For: 1.7, 2.2 Attachments: NUTCH-1284-2.x.v1.patch, NUTCH-1284.patch, NUTCH-1284-trunk.v1.patch Currently, when manually scanning our log output we cannot infer which pages are governed by a crawl delay between successive fetch attempts of any given page within the site. The value should be made available as something like: {code} 2012-02-19 12:33:33,031 INFO fetcher.Fetcher - fetching http://nutch.apache.org/ (crawl.delay=XXXms) {code} This way we can easily and quickly determine whether the fetcher is having to use this functionality or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Build failed in Jenkins: Nutch-nutchgora #478
See https://builds.apache.org/job/Nutch-nutchgora/478/changes Changes: [tejasp] NUTCH-1284 Add site fetcher.max.crawl.delay as log output by default -- [...truncated 3497 lines...] init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/ivy/ivysettings.xml compile: [echo] Compiling plugin: protocol-file [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds jar: deps-test: deploy: copy-generated-lib: deploy: copy-generated-lib: test: [echo] Testing plugin: parse-js [junit] Running org.apache.nutch.parse.js.TestJSParseFilter [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.407 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/ivy/ivysettings.xml compile: [echo] Compiling plugin: index-anchor [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds compile-test: [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/build-plugin.xml:180: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/index-anchor/test [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: index-anchor [junit] Running org.apache.nutch.indexer.anchor.TestAnchorIndexingFilter [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.733 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/ivy/ivysettings.xml compile: [echo] Compiling plugin: index-basic [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds compile-test: [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/build-plugin.xml:180: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/index-basic/test [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: index-basic [junit] Running org.apache.nutch.indexer.basic.TestBasicIndexingFilter [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.989 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/ivy/ivysettings.xml compile: [echo] Compiling plugin: index-more [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds compile-test: [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/build-plugin.xml:180: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/index-more/test [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: index-more [junit] Running org.apache.nutch.indexer.more.TestMoreIndexingFilter [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.307 sec init: init-plugin: [echo] Copying language profiles [echo] Copying test files deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file =
[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564114#comment-13564114 ] Hudson commented on NUTCH-1284: --- Integrated in Nutch-trunk #2103 (See [https://builds.apache.org/job/Nutch-trunk/2103/]) NUTCH-1284 Add site fetcher.max.crawl.delay as log output by default (Revision 1439289) Result = SUCCESS tejasp : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1439289 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java Add site fetcher.max.crawl.delay as log output by default. -- Key: NUTCH-1284 URL: https://issues.apache.org/jira/browse/NUTCH-1284 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Tejas Patil Priority: Trivial Fix For: 1.7, 2.2 Attachments: NUTCH-1284-2.x.v1.patch, NUTCH-1284.patch, NUTCH-1284-trunk.v1.patch Currently, when manually scanning our log output we cannot infer which pages are governed by a crawl delay between successive fetch attempts of any given page within the site. The value should be made available as something like: {code} 2012-02-19 12:33:33,031 INFO fetcher.Fetcher - fetching http://nutch.apache.org/ (crawl.delay=XXXms) {code} This way we can easily and quickly determine whether the fetcher is having to use this functionality or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil reassigned NUTCH-1465: -- Assignee: Tejas Patil Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.7 Attachments: NUTCH-1465-trunk.v1.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564173#comment-13564173 ] Julien Nioche commented on NUTCH-1047: -- @tejasp can reproduce the issue and am looking into it, thanks. Somehow the configuration does not get passed on properly when using the crawl command. Thanks. Lufeng {quote} But i don't know why not add an option to set IndexerUrl such as bin/nutch solrindex -indexurl http://localhost:8983/solr/. {quote} whether it is passed as a parameter or via configuration should not make much of a difference. Your suggestion also assumes that the indexing backend can be reached via a single URL which is not necessarily the case as it could not need a URL at all or at the opposite need multiple URLs. Better to leave that logic in the configuration and assume that the backends will find whatever they need there. {quote} the corrent command to invoke the IndexingJob command is bin/nutch solrindex http://localhost:8983/solr/ crawldb/ segments/20130121115214/ -filter. {quote} as explained above we want to keep compatibility with the existing sorlindex command and not change its syntax. Underneath it uses the new code based on plugins but sets the value of the solr config. There is no shortcut for the generic indexing job command in the nutch script yet but we could add one. For now it has to be called in full e.g. bin/nutch org.apache.nutch.indexer.IndexingJob ... which will make sense when we have other indexing backends and not just SOLR. Think about 'nutch solrindex' as a shortcut for the generic command. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564187#comment-13564187 ] Tejas Patil commented on NUTCH-1047: Hi Julien, After reply from @lufeng, I was able to perform indexing with the crawl command. Here is a summary of things I have observed: ||solr.server.url in nutch-site.xml||-D in crawl command||Works ?|| |no|no|RuntimeException: Missing SOLR URL| |no|yes|yes| |yes|no|yes| |yes|yes|yes| Note that I had to pass -solr and solr url everytime. Else it didnt invoke indexing. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564196#comment-13564196 ] Julien Nioche commented on NUTCH-1047: -- Hi Tejas It will work everytime you set it in nutch-site.xml. As for setting it with -D in the crawl command - you definitely should not have to do that and this is where the bug is. The problem is that for some reason we value we take from the crawl command is correctly set in the configuration object however the later is reloaded or overridden during the call to JobClient.runJob(job) (IndexingJob line 120). BTW the crawl command is deprecated and should be removed at some point as we have the crawl script. Could you try using the SOLRIndex command as well as the crawl script while I try and solve the problem with the crawl command? Thanks Julien Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564252#comment-13564252 ] Tejas Patil commented on NUTCH-1047: Hi Julien, The solrindex commmand and crawl script are work fine after setting solr.server.url in nutch-site.xml. I did not use -D option during these runs. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564263#comment-13564263 ] Julien Nioche commented on NUTCH-1047: -- Tejas The crawl script and the solr index should work without setting solr.server.url in nutch-site.xml or using -D as this is handled for you in the nutch script. Can you please test without specifying solr.server.url in nutch-site.xml? Thanks Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564274#comment-13564274 ] Sebastian Nagel commented on NUTCH-1465: Hi Tejas, thanks and a few comments on the patch: “??for a given host, sitemaps are processed just once??” But they are not cached over cycles because the cache is bound to the protocol object. Is this correct? So a sitemap is fetched and processed every cycle for every host? If yes and sitemaps are large (they can!) this would cause a lot of extra traffic. Shouldn't sitemap URLs handled the same way as any other URL: add them to CrawlDb, fetch and parse once, add found links to CrawlDb, cf. [Ken's post at CC|https://groups.google.com/forum/?fromgroups#!topic/crawler-commons/DrAX4Th1A4I]. There are some complications: - due to their size, sitemaps may require larger values regarding size and time limits - sitemaps may require more frequent re-fetching (eg. by MimeAdaptiveFetchSchedule) - the current Outlink class cannot hold extra information contained in sitemaps (lastmod, changefreq, etc.) There is another way which we use it for several customers: A SitemapInjector fetches the sitemaps, extracts URLs and injects them with all extra information. It's a simple use case for a customized site-search: there is a sitemap and it shall be used as seed list or even exclusive list of documents to be crawled. Is there any interest in this solution? It's not a general solution and not adaptable to a large web crawl. Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.7 Attachments: NUTCH-1465-trunk.v1.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564768#comment-13564768 ] Sebastian Nagel commented on NUTCH-1465: Yes, SitemapInjector is a map-reduce job. The scenario for its use is the following: - a small set of sites to be crawled (eg, to feed a site-search index) - you can think of sitemaps as remote seed lists. Because many content management systems can generate sitemaps it is convenient for the site owners to publish seeds. The URLs contained in the sitemap can be also the complete and exclusive set of URLs to be crawled (you can use the plugin scoring-depth to limit the crawl to seed URLs). - because you can trust in the sitemap's content -* checks for cross submissions are not necessary -* extra information (lastmod, changefreq, priority) can be used That's we use sitemaps: remote seed lists, maintained by customers, quite convenient if you run a crawler as a service. For large web crawls there is also another aspect: detection of sitemaps which is bound to processing of robots.txt. Processing of sitemaps can (and should?) be done the usual Nutch way: - detection is done in the protocol plugin (see Tejas' patch) - record in CrawlDb: done by Fetcher (cross submission information can be added) - fetch (if not yet done), parse (a plugin parse-sitemap based on crawler-commons?) and extract outlinks: sitemaps may require special treatment here because they can be large in size and usually contain many outlinks. Also the Outlink class needs to be extended to deal with the extra info relevant for scheduling To use an extra tool (as the SitemapInjector) for processing the sitemaps has the disadvantage that we first must get all sitemap URLs out of the CrawlDb. On the contrary, special treatment can easily be realized in a separate map-reduce job. Comments?! Thanks, Tejas: the feature is moving forward thanks to your initiative! Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.7 Attachments: NUTCH-1465-trunk.v1.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564827#comment-13564827 ] Sebastian Nagel commented on NUTCH-1047: As some test for the interface started to implement a CSV-indexer - useful for exporting crawled data or for quick analysis. First working version (draft, still a lot to do) within 100+ lines of code: +1 for the interface / extension point. Some concerns about the usability of IndexingJob as a daily tool: - it's not really transparent which indexer is run (solr, elastic, etc.): you have to look into the property plugin-includes - options must be passed to indexer plugins as properties: complicated, no help to get a list of available properties Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564836#comment-13564836 ] Markus Jelsma commented on NUTCH-1465: -- Thanks all for your interesting comments. It's a complicated issue. One one hand host data should be stored in NUTCH-1325 but that would require additional logic and sending each segment output to the hostdb, in case there's a sitemap crawled. On the other hand it's ideal to store host data. It's also easy to use in jobs such as the indexer and generator. I don't yet favour a specific approach but storing sitemap data in a hostdb may be something to think about. Cheers Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.7 Attachments: NUTCH-1465-trunk.v1.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564883#comment-13564883 ] Tejas Patil commented on NUTCH-1465: Hi Sebastian, So we are looking at 2 things here: - a standalone utility for injecting sitemaps to crawldb: -# User starts off with urls to sitemap pages -# SitemapInjector fetches these seeds, parses it (with a parse plugin based on CC) -# SitemapInjector updates the crawldb with the sitemap entries. - handling of sitemap within the nutch cycle: fetch, parse and update phases -# Robots parsing will populate a table of host: _list of links to sitemap pages_ -# These will be added to the fetcher queue and will be fetched -# A parser plugin based on CC will parse the sitemap page -# Outlink class needs to be extended to store the meta obtained from sitemap -# Write this into the segment -# Update phase needs to update the crawl frequency of already existing urls in crawldb based on what we got from the sitemap. Else just add new entires to the crawldb. I am not clear about the extending outlink thing. The normal outlink extraction need not be done as CC will already do that for us. Sitemap parser plugin must do this and create objects of our specialized sitemap link. While writing, where is CrawlDatum generated from the outlink ? The mime type that we get is text/xml which can also mean a normal xml file. How will nutch identify if its a sitemap page and invoke the correct parser plugin ? (I know that this magic is done by feed parser but not sure which part of code is doing that. Just point me to that code). Support sitemaps in Nutch - Key: NUTCH-1465 URL: https://issues.apache.org/jira/browse/NUTCH-1465 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Tejas Patil Fix For: 1.7 Attachments: NUTCH-1465-trunk.v1.patch I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and appears to have been used successfully to parse sitemaps as per the discussion here[1]. [0] http://sourceforge.net/projects/sitemap-parser/ [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-945) Indexing to multiple SOLR Servers
[ https://issues.apache.org/jira/browse/NUTCH-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564953#comment-13564953 ] Alexander Kingson commented on NUTCH-945: - I see that the issue is unresolved.Is this patch working? Indexing to multiple SOLR Servers - Key: NUTCH-945 URL: https://issues.apache.org/jira/browse/NUTCH-945 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.2 Reporter: Charan Malemarpuram Fix For: 2.2 Attachments: MurmurHashPartitioner.java, NonPartitioningPartitioner.java, patch-NUTCH-945.txt It would be nice to have a default Indexer in Nutch, which can submit docs to multiple SOLR Servers. Partitioning is always the question, when writing to multiple SOLR Servers. Default partitioning can be a simple hashcode based distribution with addition hooks to customization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Jenkins build is back to normal : Nutch-nutchgora #479
See https://builds.apache.org/job/Nutch-nutchgora/479/