[jira] Commented: (NUTCH-753) Prevent new Fetcher to retrieve the robots twice
[ https://issues.apache.org/jira/browse/NUTCH-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783235#action_12783235 ] Hudson commented on NUTCH-753: -- Integrated in Nutch-trunk #995 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/995/]) Prevent new Fetcher from retrieving the robots twice. Prevent new Fetcher to retrieve the robots twice Key: NUTCH-753 URL: https://issues.apache.org/jira/browse/NUTCH-753 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-753.patch The new Fetcher which is now used by default handles the robots file directly instead of relying on the protocol. The options Protocol.CHECK_BLOCKING and Protocol.CHECK_ROBOTS are set to false to prevent fetching the robots.txt twice (in Fetcher + in protocol), which avoids calling robots.isAllowed. However in practice the robots file is still fetched as there is a call to robots.getCrawlDelay() a bit further which is not covered by the if (Protocol.CHECK_ROBOTS). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-773) some minor bugs in AbstractFetchSchedule.java
[ https://issues.apache.org/jira/browse/NUTCH-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783238#action_12783238 ] Hudson commented on NUTCH-773: -- Integrated in Nutch-trunk #995 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/995/]) Some minor bugs in AbstractFetchSchedule. some minor bugs in AbstractFetchSchedule.java - Key: NUTCH-773 URL: https://issues.apache.org/jira/browse/NUTCH-773 Project: Nutch Issue Type: Bug Components: fetcher, generator Affects Versions: 1.0.0 Reporter: Reinhard Schwab Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.1 Attachments: NUTCH-773.patch fixes some minor trivial bugs in AbstractFetchSchedule.java -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-772) Upgrade Nutch to use Lucene 2.9.1
[ https://issues.apache.org/jira/browse/NUTCH-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783236#action_12783236 ] Hudson commented on NUTCH-772: -- Integrated in Nutch-trunk #995 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/995/]) Upgrade Nutch to use Lucene 2.9.1. Upgrade Nutch to use Lucene 2.9.1 - Key: NUTCH-772 URL: https://issues.apache.org/jira/browse/NUTCH-772 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: lucene.patch Upgrade Nutch to the latest Lucene release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-760) Allow field mapping from nutch to solr index
[ https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783237#action_12783237 ] Hudson commented on NUTCH-760: -- Integrated in Nutch-trunk #995 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/995/]) Add part of . Allow field mapping from nutch to solr index. Allow field mapping from nutch to solr index Key: NUTCH-760 URL: https://issues.apache.org/jira/browse/NUTCH-760 Project: Nutch Issue Type: Improvement Components: indexer Reporter: David Stuart Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch I am using nutch to crawl sites and have combined it with solr pushing the nutch index using the solrindex command. I have set it up as specified on the wiki using the copyField url to id in the schema. Whilst this works fine it is stuff's up my inputs from other sources in solr (e.g. using the solr data import handler) as they have both id's and url's. I have patch that implements a nutch xml schema defining what basic nutch fields map to in your solr push. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-765) Allow Crawl class to call Either Solr or Lucene Indexer
[ https://issues.apache.org/jira/browse/NUTCH-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783234#action_12783234 ] Hudson commented on NUTCH-765: -- Integrated in Nutch-trunk #995 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/995/]) - Allow Crawl class to call Either Solr or Lucene Indexer. Allow Crawl class to call Either Solr or Lucene Indexer --- Key: NUTCH-765 URL: https://issues.apache.org/jira/browse/NUTCH-765 Project: Nutch Issue Type: Improvement Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.0.0, 1.1 Attachments: NUTCH-765-2009112-1.patch Change to the crawl class to have a -solr option which will call the solr indexer instead of the lucene indexer. This also allows it to ignore dedup and merge for solr indexing and to point to a specific solr instance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer
[ https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783239#action_12783239 ] Hudson commented on NUTCH-761: -- Integrated in Nutch-trunk #995 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/995/]) Fix a bug resulting from over-eager optimization in . Avoid cloning CrawlDatum in CrawlDbReducer. Avoid cloningCrawlDatum in CrawlDbReducer -- Key: NUTCH-761 URL: https://issues.apache.org/jira/browse/NUTCH-761 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.1 Attachments: optiCrawlReducer.patch In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its reduce phase and these will be the entries coming from the crawlDB and not present in the segments. The patch attached optimizes the reduce step by avoid an unnecessary cloning of the CrawlDatum fields when there is only one CrawlDatum in the values. This has more impact has the crawlDB gets larger, we noticed an improvement of around 25-30% in the time spent in the reduce phase. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-770) Timebomb for Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MilleBii updated NUTCH-770: --- Attachment: log-770 Please find the logs of the patch... I did effectively try it but I could not compile after it. Timebomb for Fetcher Key: NUTCH-770 URL: https://issues.apache.org/jira/browse/NUTCH-770 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Attachments: log-770, NUTCH-770.patch This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions
[ https://issues.apache.org/jira/browse/NUTCH-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-769: Attachment: NUTCH-769-2.patch Fetcher to skip queues for URLS getting repeated exceptions - Key: NUTCH-769 URL: https://issues.apache.org/jira/browse/NUTCH-769 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Julien Nioche Priority: Minor Attachments: NUTCH-769-2.patch, NUTCH-769.patch As discussed on the mailing list (see http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg15360.html) this patch allows to clear URLs queues in the Fetcher when more than a set number of exceptions have been encountered in a row. This can speed up the fetching substantially in cases where target hosts are not responsive (as a TimeoutException would be thrown) and limits cases where a whole Fetch step is slowed down because of a few queues. by default the parameter fetcher.max.exceptions.per.queue has a value of -1 and is deactivated. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions
[ https://issues.apache.org/jira/browse/NUTCH-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783247#action_12783247 ] Julien Nioche commented on NUTCH-769: - Missed a couple of lines indeed when I was trying to untangle this functionality from my (heavily modified) local copy. checkExceptionThreshold is called after the line 664 case ProtocolStatus.EXCEPTION: logError(fit.url, status.getMessage()); int killedURLs = fetchQueues.checkExceptionThreshold(fit.getQueueID()); reporter.incrCounter(FetcherStatus, Exceptions, killedURLs); I'll attach a modified version of the patch Thanks J. Fetcher to skip queues for URLS getting repeated exceptions - Key: NUTCH-769 URL: https://issues.apache.org/jira/browse/NUTCH-769 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Julien Nioche Priority: Minor Attachments: NUTCH-769-2.patch, NUTCH-769.patch As discussed on the mailing list (see http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg15360.html) this patch allows to clear URLs queues in the Fetcher when more than a set number of exceptions have been encountered in a row. This can speed up the fetching substantially in cases where target hosts are not responsive (as a TimeoutException would be thrown) and limits cases where a whole Fetch step is slowed down because of a few queues. by default the parameter fetcher.max.exceptions.per.queue has a value of -1 and is deactivated. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-770) Timebomb for Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783248#action_12783248 ] Julien Nioche commented on NUTCH-770: - The log simply shows that the patch has not been applied properly. See http://markmail.org/message/wbd3r3t5bfxzkbpn for a discussion on how to apply patches Should work fine from the root directory of Nutch with patch -p0 ~/Desktop/NUTCH-770.patch Timebomb for Fetcher Key: NUTCH-770 URL: https://issues.apache.org/jira/browse/NUTCH-770 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Attachments: log-770, NUTCH-770.patch This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-770) Timebomb for Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783252#action_12783252 ] MilleBii commented on NUTCH-770: That's what I did and just retried ... so I'm a bit suprised too. Other patches worked fine so far. ??? Timebomb for Fetcher Key: NUTCH-770 URL: https://issues.apache.org/jira/browse/NUTCH-770 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Attachments: log-770, NUTCH-770.patch This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-770) Timebomb for Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783283#action_12783283 ] Andrzej Bialecki commented on NUTCH-770: - I propose to change the name of this functionality - timebomb is not self-explanatory, and it suggests that if you misbehave then your cluster may explode ;) Instead I would use time limit, rename all vars and methods to follow this naming, and document it properly in nutch-default.xml. A few comments to the patch: * it has some overlap with NUTCH-769 (the emptyQueue() method), but that's easy to resolve, see also the next point. * why change the code in FetchQueues at all? Time limit is a global condition, we could just break the main loop in run() and ignore the QueueFeeder (or don't start it if the time limit already passed when starting run() ). * the patch does not follow the code style (notably whitespace in for/while loops and assignments). Timebomb for Fetcher Key: NUTCH-770 URL: https://issues.apache.org/jira/browse/NUTCH-770 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Attachments: log-770, NUTCH-770.patch This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-746) NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container.
[ https://issues.apache.org/jira/browse/NUTCH-746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-746. --- Resolution: Fixed Assignee: Andrzej Bialecki NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container. Key: NUTCH-746 URL: https://issues.apache.org/jira/browse/NUTCH-746 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Environment: Apache Tomcat 5.5.27 and 6.0.18, Fedora 11, OpenJDK or Sun JDK 1.6 OpenJDK 64-Bit Server VM (build 14.0-b15, mixed mode) Reporter: Kirby Bohling Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-746.patch NutchBeanConstructor is not cleaning up upon application shutdown (contextDestroyed()). It leaves open the SegmentUpdater, and potentially other resources. This causes the WebApp's classloader to not be able to GC'ed in Tomcat, which after repeated restarts will lead to a PermGen error. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-746) NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container.
[ https://issues.apache.org/jira/browse/NUTCH-746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783287#action_12783287 ] Andrzej Bialecki commented on NUTCH-746: - Fixed in rev. 885148. Thanks! NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container. Key: NUTCH-746 URL: https://issues.apache.org/jira/browse/NUTCH-746 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Environment: Apache Tomcat 5.5.27 and 6.0.18, Fedora 11, OpenJDK or Sun JDK 1.6 OpenJDK 64-Bit Server VM (build 14.0-b15, mixed mode) Reporter: Kirby Bohling Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-746.patch NutchBeanConstructor is not cleaning up upon application shutdown (contextDestroyed()). It leaves open the SegmentUpdater, and potentially other resources. This causes the WebApp's classloader to not be able to GC'ed in Tomcat, which after repeated restarts will lead to a PermGen error. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-738) Close SegmentUpdater when FetchedSegments is closed
[ https://issues.apache.org/jira/browse/NUTCH-738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-738. --- Resolution: Fixed Assignee: Andrzej Bialecki Close SegmentUpdater when FetchedSegments is closed --- Key: NUTCH-738 URL: https://issues.apache.org/jira/browse/NUTCH-738 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Reporter: Martina Koch Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.1 Attachments: FetchedSegments.patch, NUTCH-738.patch Currently FetchedSegments starts a SegmentUpdater, but never closes it when FetchedSegments is closed. (The problem was described in this mailing: http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg13823.html) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-739. --- Resolution: Fixed Assignee: Andrzej Bialecki SolrDeleteDuplications too slow when using hadoop - Key: NUTCH-739 URL: https://issues.apache.org/jira/browse/NUTCH-739 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: hadoop cluster with 3 nodes Map Task Capacity: 6 Reduce Task Capacity: 6 Indexer: one instance of solr server (on the one of slave nodes) Reporter: Dmitry Lihachev Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch in my environment i always have many warnings like this on the dedup step {noformat} Task attempt_200905270022_0212_r_03_0 failed to report status for 600 seconds. Killing! {noformat} solr logs: {noformat} INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173741 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {optimize=} 0 173599 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173599 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close INFO: Closing searc...@2ad9ac58 main May 27, 2009 10:29:27 AM org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo WARNING: Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed {noformat} So I think the problem in the piece of code on line 301 of SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks each of ones tries to optimize solr indexes before closing. The simplest way to avoid this bug - removing this line and sending optimize/ message directly to solr server after dedup step -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783290#action_12783290 ] Andrzej Bialecki commented on NUTCH-739: - Fixed in rev. 885152. Thank you! SolrDeleteDuplications too slow when using hadoop - Key: NUTCH-739 URL: https://issues.apache.org/jira/browse/NUTCH-739 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: hadoop cluster with 3 nodes Map Task Capacity: 6 Reduce Task Capacity: 6 Indexer: one instance of solr server (on the one of slave nodes) Reporter: Dmitry Lihachev Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch in my environment i always have many warnings like this on the dedup step {noformat} Task attempt_200905270022_0212_r_03_0 failed to report status for 600 seconds. Killing! {noformat} solr logs: {noformat} INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173741 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {optimize=} 0 173599 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173599 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close INFO: Closing searc...@2ad9ac58 main May 27, 2009 10:29:27 AM org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo WARNING: Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed {noformat} So I think the problem in the piece of code on line 301 of SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks each of ones tries to optimize solr indexes before closing. The simplest way to avoid this bug - removing this line and sending optimize/ message directly to solr server after dedup step -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-755) DomainURLFilter crashes on malformed URL
[ https://issues.apache.org/jira/browse/NUTCH-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-755. --- Resolution: Cannot Reproduce Assignee: Andrzej Bialecki DomainURLFilter crashes on malformed URL Key: NUTCH-755 URL: https://issues.apache.org/jira/browse/NUTCH-755 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Tomcat 6.0.14 Java 1.6.0_14 Linux Reporter: Mike Baranczak Assignee: Andrzej Bialecki 2009-09-16 21:54:17,001 ERROR [Thread-156] DomainURLFilter - Could not apply filter on url: http:/comments.php java.lang.NullPointerException at org.apache.nutch.urlfilter.domain.DomainURLFilter.filter(DomainURLFilter.java:173) at org.apache.nutch.net.URLFilters.filter(URLFilters.java:88) at org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:200) at org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:113) at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:96) at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:70) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410) at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) Expected behavior would be to recognize the URL as malformed, and reject it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-755) DomainURLFilter crashes on malformed URL
[ https://issues.apache.org/jira/browse/NUTCH-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783299#action_12783299 ] Andrzej Bialecki commented on NUTCH-755: - I could not verify that the filter indeed crashes - it simply prints the exception and then returns null, as you suggested. DomainURLFilter crashes on malformed URL Key: NUTCH-755 URL: https://issues.apache.org/jira/browse/NUTCH-755 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Tomcat 6.0.14 Java 1.6.0_14 Linux Reporter: Mike Baranczak Assignee: Andrzej Bialecki 2009-09-16 21:54:17,001 ERROR [Thread-156] DomainURLFilter - Could not apply filter on url: http:/comments.php java.lang.NullPointerException at org.apache.nutch.urlfilter.domain.DomainURLFilter.filter(DomainURLFilter.java:173) at org.apache.nutch.net.URLFilters.filter(URLFilters.java:88) at org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:200) at org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:113) at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:96) at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:70) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410) at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) Expected behavior would be to recognize the URL as malformed, and reject it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783302#action_12783302 ] Andrzej Bialecki commented on NUTCH-692: - We should review this issue after the upgrade to Hadoop 0.20 - task output mgmt differs there, and the problem may be nonexistent. AlreadyBeingCreatedException with Hadoop 0.19 - Key: NUTCH-692 URL: https://issues.apache.org/jira/browse/NUTCH-692 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Julien Nioche Attachments: NUTCH-692.patch I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up. There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19 I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0? I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue J. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-741) Job file includes multiple copies of nutch config files.
[ https://issues.apache.org/jira/browse/NUTCH-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783304#action_12783304 ] Andrzej Bialecki commented on NUTCH-741: - Fixed in rev. 885156. Thank you! Job file includes multiple copies of nutch config files. Key: NUTCH-741 URL: https://issues.apache.org/jira/browse/NUTCH-741 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Kirby Bohling Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.1 Attachments: removeJobDupConf.diff From a clean checkout, running ant tar will create a .job file. The .job file includes two copies of the nutch-site.xml and nutch-default.xml file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-741) Job file includes multiple copies of nutch config files.
[ https://issues.apache.org/jira/browse/NUTCH-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-741. --- Resolution: Fixed Fix Version/s: 1.1 Assignee: Andrzej Bialecki Job file includes multiple copies of nutch config files. Key: NUTCH-741 URL: https://issues.apache.org/jira/browse/NUTCH-741 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Kirby Bohling Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.1 Attachments: removeJobDupConf.diff From a clean checkout, running ant tar will create a .job file. The .job file includes two copies of the nutch-site.xml and nutch-default.xml file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers
[ https://issues.apache.org/jira/browse/NUTCH-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-712. --- Resolution: Fixed Fix Version/s: 1.1 Assignee: Andrzej Bialecki ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers - Key: NUTCH-712 URL: https://issues.apache.org/jira/browse/NUTCH-712 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: ParseOutputFormat-NUTCH712v2.patch ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers otherwise the whole parsing step crashes instead of simply ignoring dodgy outlinks -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers
[ https://issues.apache.org/jira/browse/NUTCH-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783306#action_12783306 ] Andrzej Bialecki commented on NUTCH-712: - Fixed in rev. 885159. Thank you! ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers - Key: NUTCH-712 URL: https://issues.apache.org/jira/browse/NUTCH-712 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: ParseOutputFormat-NUTCH712v2.patch ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers otherwise the whole parsing step crashes instead of simply ignoring dodgy outlinks -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[Nutch Wiki] Trivial Update of Automating_Fetches_wi th_Python by newacct
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Automating_Fetches_with_Python page has been changed by newacct. http://wiki.apache.org/nutch/Automating_Fetches_with_Python?action=diffrev1=5rev2=6 -- import sys import getopt import re - import string import logging import logging.config import commands @@ -259, +258 @@ total_urls += 1 urllinecount.close() numsplits = total_urls / splitsize - padding = 0 * len(`numsplits`) + padding = 0 * len(repr(numsplits)) # create the url load folder - linenum = 0 filenum = 0 - strfilenum = `filenum` + strfilenum = repr(filenum) urloutdir = outdir + /urls- + padding[len(strfilenum):] + strfilenum os.mkdir(urloutdir) urlfile = urloutdir + /urls @@ -275, +273 @@ outhandle = open(urlfile, w) # loop through the file - for line in inhandle: + for linenum, line in enumerate(inhandle): # if we have come to a split then close the current file, create a new # url folder and open a new url file - if linenum 0 and (linenum % splitsize == 0): + if linenum 0 and linenum % splitsize == 0: - filenum = filenum + 1 + filenum += 1 - strfilenum = `filenum` + strfilenum = repr(filenum) urloutdir = outdir + /urls- + padding[len(strfilenum):] + strfilenum os.mkdir(urloutdir) urlfile = urloutdir + /urls @@ -290, +288 @@ outhandle.close() outhandle = open(urlfile, w) - # write the url to the file and increase the number of lines read + # write the url to the file outhandle.write(line) - linenum = linenum + 1 # close the input and output files inhandle.close() @@ -362, +359 @@ # fetch the current segment outar = result[1].splitlines() - output = outar[len(outar) - 1] + output = outar[-1] - tempseg = string.split(output)[0] + tempseg = output.split()[0] tempseglist.append(tempseg) fetch = self.nutchdir + /bin/nutch fetch + tempseg self.log.info(Starting fetch for: + tempseg) @@ -392, +389 @@ # merge the crawldbs self.log.info(Merging master and temp crawldbs.) - crawlmerge = (self.nutchdir + /bin/nutch mergedb mergetemp/crawldb + + crawlmerge = self.nutchdir + /bin/nutch mergedb mergetemp/crawldb + \ - mastercrawldbdir + + string.join(tempdblist, )) + mastercrawldbdir + + .join(tempdblist) self.log.info(Running: + crawlmerge) result = commands.getstatusoutput(crawlmerge) self.checkStatus(result, Error occurred while running command + crawlmerge) @@ -404, +401 @@ result = commands.getstatusoutput(getsegment) self.checkStatus(result, Error occurred while running command + getsegment) outar = result[1].splitlines() - output = outar[len(outar) - 1] + output = outar[-1] - masterseg = string.split(output)[0] + masterseg = output.split()[0] - mergesegs = (self.nutchdir + /bin/nutch mergesegs mergetemp/segments + + mergesegs = self.nutchdir + /bin/nutch mergesegs mergetemp/segments + \ - masterseg + + string.join(tempseglist, )) + masterseg + + .join(tempseglist) self.log.info(Running: + mergesegs) result = commands.getstatusoutput(mergesegs) self.checkStatus(result, Error occurred while running command + mergesegs) @@ -464, +461 @@ usage.append([-b | --backupdir] The master backup directory, [crawl-backup].\n) usage.append([-s | --splitsize] The number of urls per load [50].\n) usage.append([-f | --fetchmerge] The number of fetches to run before merging [1].\n) - message = string.join(usage) + message = .join(usage) print message
[jira] Commented: (NUTCH-738) Close SegmentUpdater when FetchedSegments is closed
[ https://issues.apache.org/jira/browse/NUTCH-738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783359#action_12783359 ] Hudson commented on NUTCH-738: -- Integrated in Nutch-trunk #996 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/996/]) Close SegmentUpdater when FetchedSegments is closed. Close SegmentUpdater when FetchedSegments is closed --- Key: NUTCH-738 URL: https://issues.apache.org/jira/browse/NUTCH-738 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Reporter: Martina Koch Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.1 Attachments: FetchedSegments.patch, NUTCH-738.patch Currently FetchedSegments starts a SegmentUpdater, but never closes it when FetchedSegments is closed. (The problem was described in this mailing: http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg13823.html) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-741) Job file includes multiple copies of nutch config files.
[ https://issues.apache.org/jira/browse/NUTCH-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783357#action_12783357 ] Hudson commented on NUTCH-741: -- Integrated in Nutch-trunk #996 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/996/]) Job file includes multiple copies of nutch config files. Job file includes multiple copies of nutch config files. Key: NUTCH-741 URL: https://issues.apache.org/jira/browse/NUTCH-741 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Kirby Bohling Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.1 Attachments: removeJobDupConf.diff From a clean checkout, running ant tar will create a .job file. The .job file includes two copies of the nutch-site.xml and nutch-default.xml file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers
[ https://issues.apache.org/jira/browse/NUTCH-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783360#action_12783360 ] Hudson commented on NUTCH-712: -- Integrated in Nutch-trunk #996 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/996/]) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers. ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers - Key: NUTCH-712 URL: https://issues.apache.org/jira/browse/NUTCH-712 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: ParseOutputFormat-NUTCH712v2.patch ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers otherwise the whole parsing step crashes instead of simply ignoring dodgy outlinks -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-746) NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container.
[ https://issues.apache.org/jira/browse/NUTCH-746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783356#action_12783356 ] Hudson commented on NUTCH-746: -- Integrated in Nutch-trunk #996 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/996/]) NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container. NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container. Key: NUTCH-746 URL: https://issues.apache.org/jira/browse/NUTCH-746 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Environment: Apache Tomcat 5.5.27 and 6.0.18, Fedora 11, OpenJDK or Sun JDK 1.6 OpenJDK 64-Bit Server VM (build 14.0-b15, mixed mode) Reporter: Kirby Bohling Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-746.patch NutchBeanConstructor is not cleaning up upon application shutdown (contextDestroyed()). It leaves open the SegmentUpdater, and potentially other resources. This causes the WebApp's classloader to not be able to GC'ed in Tomcat, which after repeated restarts will lead to a PermGen error. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783358#action_12783358 ] Hudson commented on NUTCH-739: -- Integrated in Nutch-trunk #996 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/996/]) SolrDeleteDuplications too slow when using hadoop. SolrDeleteDuplications too slow when using hadoop - Key: NUTCH-739 URL: https://issues.apache.org/jira/browse/NUTCH-739 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: hadoop cluster with 3 nodes Map Task Capacity: 6 Reduce Task Capacity: 6 Indexer: one instance of solr server (on the one of slave nodes) Reporter: Dmitry Lihachev Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch in my environment i always have many warnings like this on the dedup step {noformat} Task attempt_200905270022_0212_r_03_0 failed to report status for 600 seconds. Killing! {noformat} solr logs: {noformat} INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173741 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {optimize=} 0 173599 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173599 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close INFO: Closing searc...@2ad9ac58 main May 27, 2009 10:29:27 AM org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo WARNING: Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed {noformat} So I think the problem in the piece of code on line 301 of SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks each of ones tries to optimize solr indexes before closing. The simplest way to avoid this bug - removing this line and sending optimize/ message directly to solr server after dedup step -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.