date:20091128

[jira] Commented: (NUTCH-753) Prevent new Fetcher to retrieve the robots twice

2009-11-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783235#action_12783235
 ] 

Hudson commented on NUTCH-753:
--

Integrated in Nutch-trunk #995 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/995/])
 Prevent new Fetcher from retrieving the robots twice.


 Prevent new Fetcher to retrieve the robots twice
 

 Key: NUTCH-753
 URL: https://issues.apache.org/jira/browse/NUTCH-753
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-753.patch


 The new Fetcher which is now used by default handles the robots file directly 
 instead of relying on the protocol. The options Protocol.CHECK_BLOCKING and 
 Protocol.CHECK_ROBOTS are set to false to prevent fetching the robots.txt 
 twice (in Fetcher + in protocol), which avoids calling robots.isAllowed. 
 However in practice the robots file is still fetched as there is a call to 
 robots.getCrawlDelay() a bit further which is not covered by the if 
 (Protocol.CHECK_ROBOTS).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-773) some minor bugs in AbstractFetchSchedule.java

2009-11-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783238#action_12783238
 ] 

Hudson commented on NUTCH-773:
--

Integrated in Nutch-trunk #995 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/995/])
 Some minor bugs in AbstractFetchSchedule.


 some minor bugs in AbstractFetchSchedule.java
 -

 Key: NUTCH-773
 URL: https://issues.apache.org/jira/browse/NUTCH-773
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, generator
Affects Versions: 1.0.0
Reporter: Reinhard Schwab
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: NUTCH-773.patch


 fixes some minor trivial bugs in AbstractFetchSchedule.java

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-772) Upgrade Nutch to use Lucene 2.9.1

2009-11-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783236#action_12783236
 ] 

Hudson commented on NUTCH-772:
--

Integrated in Nutch-trunk #995 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/995/])
 Upgrade Nutch to use Lucene 2.9.1.


 Upgrade Nutch to use Lucene 2.9.1
 -

 Key: NUTCH-772
 URL: https://issues.apache.org/jira/browse/NUTCH-772
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: lucene.patch


 Upgrade Nutch to the latest Lucene release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-760) Allow field mapping from nutch to solr index

2009-11-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783237#action_12783237
 ] 

Hudson commented on NUTCH-760:
--

Integrated in Nutch-trunk #995 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/995/])
Add part of .
 Allow field mapping from nutch to solr index.


 Allow field mapping from nutch to solr index
 

 Key: NUTCH-760
 URL: https://issues.apache.org/jira/browse/NUTCH-760
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: David Stuart
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: solrindex_schema.patch, solrindex_schema.patch, 
 solrindex_schema.patch, solrindex_schema.patch


 I am using nutch to crawl sites and have combined it
 with solr pushing the nutch index using the solrindex command. I have
 set it up as specified on the wiki using the copyField url to id in the
 schema. Whilst this works fine it is stuff's up my inputs from other
 sources in solr (e.g. using the solr data import handler) as they have
 both id's and url's. I have patch that implements a nutch xml schema
 defining what basic nutch fields map to in your solr push.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-765) Allow Crawl class to call Either Solr or Lucene Indexer

2009-11-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783234#action_12783234
 ] 

Hudson commented on NUTCH-765:
--

Integrated in Nutch-trunk #995 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/995/])
 - Allow Crawl class to call Either Solr or Lucene Indexer.


 Allow Crawl class to call Either Solr or Lucene Indexer
 ---

 Key: NUTCH-765
 URL: https://issues.apache.org/jira/browse/NUTCH-765
 Project: Nutch
  Issue Type: Improvement
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0, 1.1

 Attachments: NUTCH-765-2009112-1.patch


 Change to the crawl class to have a -solr option which will call the solr 
 indexer instead of the lucene indexer.  This also allows it to ignore dedup 
 and merge for solr indexing and to point to a specific solr instance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

2009-11-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783239#action_12783239
 ] 

Hudson commented on NUTCH-761:
--

Integrated in Nutch-trunk #995 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/995/])
Fix a bug resulting from over-eager optimization in .
 Avoid cloning CrawlDatum in CrawlDbReducer.


 Avoid cloningCrawlDatum in CrawlDbReducer 
 --

 Key: NUTCH-761
 URL: https://issues.apache.org/jira/browse/NUTCH-761
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: optiCrawlReducer.patch


 In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its 
 reduce phase and these will be the entries coming from the crawlDB and not 
 present in the segments.
 The patch attached optimizes the reduce step by avoid an unnecessary cloning 
 of the CrawlDatum fields when there is only one CrawlDatum in the values. 
 This has more impact has the crawlDB gets larger,  we noticed an improvement 
 of around 25-30% in the time spent in the reduce phase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-770) Timebomb for Fetcher

2009-11-28 Thread MilleBii (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MilleBii updated NUTCH-770:
---

Attachment: log-770

Please find the logs of the patch... I did effectively try it but I could not 
compile after it.

 Timebomb for Fetcher
 

 Key: NUTCH-770
 URL: https://issues.apache.org/jira/browse/NUTCH-770
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
 Attachments: log-770, NUTCH-770.patch


 This patch provides the Fetcher with a timebomb mechanism. By default the 
 timebomb is not activated; it can be set using the parameter 
 fetcher.timebomb.mins. The number of minutes is relative to the start of the 
 Fetch job. When the number of minutes is reached, the QueueFeeder skips all 
 remaining entries then all active queues are purged. This allows to keep the 
 Fetch step under comtrol and works well in combination with NUTCH-769

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions

2009-11-28 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-769:


Attachment: NUTCH-769-2.patch

 Fetcher to skip queues for URLS getting repeated exceptions  
 -

 Key: NUTCH-769
 URL: https://issues.apache.org/jira/browse/NUTCH-769
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Julien Nioche
Priority: Minor
 Attachments: NUTCH-769-2.patch, NUTCH-769.patch


 As discussed on the mailing list (see 
 http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg15360.html) this 
 patch allows to clear URLs queues in the Fetcher when more than a set number 
 of exceptions have been encountered in a row. This can speed up the fetching 
 substantially in cases where target hosts are not responsive (as a 
 TimeoutException would be thrown) and limits cases where a whole Fetch step 
 is slowed down because of a few queues.
 by default the parameter fetcher.max.exceptions.per.queue has a value of -1 
 and is deactivated.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions

2009-11-28 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783247#action_12783247
 ] 

Julien Nioche commented on NUTCH-769:
-

Missed a couple of lines indeed when I was trying to untangle this 
functionality from my (heavily modified) local copy.
checkExceptionThreshold is called after the line 664

  case ProtocolStatus.EXCEPTION:
logError(fit.url, status.getMessage());
int killedURLs = 
fetchQueues.checkExceptionThreshold(fit.getQueueID());
reporter.incrCounter(FetcherStatus, Exceptions, killedURLs);

I'll attach a modified version of the patch

Thanks

J.

 Fetcher to skip queues for URLS getting repeated exceptions  
 -

 Key: NUTCH-769
 URL: https://issues.apache.org/jira/browse/NUTCH-769
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Julien Nioche
Priority: Minor
 Attachments: NUTCH-769-2.patch, NUTCH-769.patch


 As discussed on the mailing list (see 
 http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg15360.html) this 
 patch allows to clear URLs queues in the Fetcher when more than a set number 
 of exceptions have been encountered in a row. This can speed up the fetching 
 substantially in cases where target hosts are not responsive (as a 
 TimeoutException would be thrown) and limits cases where a whole Fetch step 
 is slowed down because of a few queues.
 by default the parameter fetcher.max.exceptions.per.queue has a value of -1 
 and is deactivated.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-770) Timebomb for Fetcher

2009-11-28 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783248#action_12783248
 ] 

Julien Nioche commented on NUTCH-770:
-

The log simply shows that the patch has not been applied properly. 
See http://markmail.org/message/wbd3r3t5bfxzkbpn for a discussion on how to 
apply patches

Should work fine from the root directory of Nutch with 
patch -p0  ~/Desktop/NUTCH-770.patch

 Timebomb for Fetcher
 

 Key: NUTCH-770
 URL: https://issues.apache.org/jira/browse/NUTCH-770
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
 Attachments: log-770, NUTCH-770.patch


 This patch provides the Fetcher with a timebomb mechanism. By default the 
 timebomb is not activated; it can be set using the parameter 
 fetcher.timebomb.mins. The number of minutes is relative to the start of the 
 Fetch job. When the number of minutes is reached, the QueueFeeder skips all 
 remaining entries then all active queues are purged. This allows to keep the 
 Fetch step under comtrol and works well in combination with NUTCH-769

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-770) Timebomb for Fetcher

2009-11-28 Thread MilleBii (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783252#action_12783252
 ] 

MilleBii commented on NUTCH-770:


That's what I did  and just retried ... so I'm a bit suprised too.
Other patches worked fine so far.

???

 Timebomb for Fetcher
 

 Key: NUTCH-770
 URL: https://issues.apache.org/jira/browse/NUTCH-770
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
 Attachments: log-770, NUTCH-770.patch


 This patch provides the Fetcher with a timebomb mechanism. By default the 
 timebomb is not activated; it can be set using the parameter 
 fetcher.timebomb.mins. The number of minutes is relative to the start of the 
 Fetch job. When the number of minutes is reached, the QueueFeeder skips all 
 remaining entries then all active queues are purged. This allows to keep the 
 Fetch step under comtrol and works well in combination with NUTCH-769

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-770) Timebomb for Fetcher

2009-11-28 Thread Andrzej Bialecki (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783283#action_12783283
]

Andrzej Bialecki commented on NUTCH-770:
-

I propose to change the name of this functionality - timebomb is not
self-explanatory, and it suggests that if you misbehave then your cluster may
explode ;) Instead I would use time limit, rename all vars and methods to
follow this naming, and document it properly in nutch-default.xml.

A few comments to the patch:

* it has some overlap with NUTCH-769 (the emptyQueue() method), but that's easy
to resolve, see also the next point.

* why change the code in FetchQueues at all? Time limit is a global condition,
we could just break the main loop in run() and ignore the QueueFeeder (or don't
start it if the time limit already passed when starting run() ).

* the patch does not follow the code style (notably whitespace in for/while
loops and assignments).

Timebomb for Fetcher

Key: NUTCH-770
URL: https://issues.apache.org/jira/browse/NUTCH-770
Project: Nutch
Issue Type: Improvement
Reporter: Julien Nioche
Attachments: log-770, NUTCH-770.patch

This patch provides the Fetcher with a timebomb mechanism. By default the
timebomb is not activated; it can be set using the parameter
fetcher.timebomb.mins. The number of minutes is relative to the start of the
Fetch job. When the number of minutes is reached, the QueueFeeder skips all
remaining entries then all active queues are purged. This allows to keep the
Fetch step under comtrol and works well in combination with NUTCH-769

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-746) NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container.

2009-11-28 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-746.
---

Resolution: Fixed
  Assignee: Andrzej Bialecki 

 NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing 
 resource leak in the container.
 

 Key: NUTCH-746
 URL: https://issues.apache.org/jira/browse/NUTCH-746
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
 Environment: Apache Tomcat 5.5.27 and 6.0.18, Fedora 11, OpenJDK or 
 Sun JDK 1.6 OpenJDK 64-Bit Server VM (build 14.0-b15, mixed mode)
Reporter: Kirby Bohling
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-746.patch


 NutchBeanConstructor is not cleaning up upon application shutdown 
 (contextDestroyed()).   It leaves open the SegmentUpdater, and potentially 
 other resources.  This causes the WebApp's classloader to not be able to 
 GC'ed in Tomcat, which after repeated restarts will lead to a PermGen error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-746) NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container.

2009-11-28 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783287#action_12783287
 ] 

Andrzej Bialecki  commented on NUTCH-746:
-

Fixed in rev. 885148. Thanks!

 NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing 
 resource leak in the container.
 

 Key: NUTCH-746
 URL: https://issues.apache.org/jira/browse/NUTCH-746
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
 Environment: Apache Tomcat 5.5.27 and 6.0.18, Fedora 11, OpenJDK or 
 Sun JDK 1.6 OpenJDK 64-Bit Server VM (build 14.0-b15, mixed mode)
Reporter: Kirby Bohling
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-746.patch


 NutchBeanConstructor is not cleaning up upon application shutdown 
 (contextDestroyed()).   It leaves open the SegmentUpdater, and potentially 
 other resources.  This causes the WebApp's classloader to not be able to 
 GC'ed in Tomcat, which after repeated restarts will lead to a PermGen error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-738) Close SegmentUpdater when FetchedSegments is closed

2009-11-28 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-738.
---

Resolution: Fixed
  Assignee: Andrzej Bialecki 

 Close SegmentUpdater when FetchedSegments is closed
 ---

 Key: NUTCH-738
 URL: https://issues.apache.org/jira/browse/NUTCH-738
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Martina Koch
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: FetchedSegments.patch, NUTCH-738.patch


 Currently FetchedSegments starts a SegmentUpdater, but never closes it when 
 FetchedSegments is closed.
 (The problem was described in this mailing: 
 http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg13823.html)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-11-28 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-739.
---

Resolution: Fixed
  Assignee: Andrzej Bialecki 

 SolrDeleteDuplications too slow when using hadoop
 -

 Key: NUTCH-739
 URL: https://issues.apache.org/jira/browse/NUTCH-739
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: hadoop cluster with 3 nodes
 Map Task Capacity: 6
 Reduce Task Capacity: 6
 Indexer: one instance of solr server (on the one of slave nodes)
Reporter: Dmitry Lihachev
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch


 in my environment i always have many warnings like this on the dedup step
 {noformat}
 Task attempt_200905270022_0212_r_03_0 failed to report status for 600 
 seconds. Killing!
 {noformat}
 solr logs:
 {noformat}
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173741
 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: {optimize=} 0 173599
 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173599
 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing searc...@2ad9ac58 main
 May 27, 2009 10:29:27 AM 
 org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
 WARNING: Could not getStatistics on info bean 
 org.apache.solr.search.SolrIndexSearcher
 org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
 
 {noformat}
 So I think the problem in the piece of code on line 301 of 
 SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
 each of ones tries to optimize solr indexes before closing.
 The simplest way to avoid this bug - removing this line and sending 
 optimize/ message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-11-28 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783290#action_12783290
 ] 

Andrzej Bialecki  commented on NUTCH-739:
-

Fixed in rev. 885152. Thank you!

 SolrDeleteDuplications too slow when using hadoop
 -

 Key: NUTCH-739
 URL: https://issues.apache.org/jira/browse/NUTCH-739
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: hadoop cluster with 3 nodes
 Map Task Capacity: 6
 Reduce Task Capacity: 6
 Indexer: one instance of solr server (on the one of slave nodes)
Reporter: Dmitry Lihachev
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch


 in my environment i always have many warnings like this on the dedup step
 {noformat}
 Task attempt_200905270022_0212_r_03_0 failed to report status for 600 
 seconds. Killing!
 {noformat}
 solr logs:
 {noformat}
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173741
 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: {optimize=} 0 173599
 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173599
 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing searc...@2ad9ac58 main
 May 27, 2009 10:29:27 AM 
 org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
 WARNING: Could not getStatistics on info bean 
 org.apache.solr.search.SolrIndexSearcher
 org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
 
 {noformat}
 So I think the problem in the piece of code on line 301 of 
 SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
 each of ones tries to optimize solr indexes before closing.
 The simplest way to avoid this bug - removing this line and sending 
 optimize/ message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-755) DomainURLFilter crashes on malformed URL

2009-11-28 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-755.
---

Resolution: Cannot Reproduce
  Assignee: Andrzej Bialecki 

 DomainURLFilter crashes on malformed URL
 

 Key: NUTCH-755
 URL: https://issues.apache.org/jira/browse/NUTCH-755
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Tomcat 6.0.14
 Java 1.6.0_14
 Linux
Reporter: Mike Baranczak
Assignee: Andrzej Bialecki 

 2009-09-16 21:54:17,001 ERROR [Thread-156] DomainURLFilter - Could not apply 
 filter on url: http:/comments.php
 java.lang.NullPointerException
 at 
 org.apache.nutch.urlfilter.domain.DomainURLFilter.filter(DomainURLFilter.java:173)
 at org.apache.nutch.net.URLFilters.filter(URLFilters.java:88)
 at 
 org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:200)
 at 
 org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:113)
 at 
 org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:96)
 at 
 org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:70)
 at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
 at 
 org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
 Expected behavior would be to recognize the URL as malformed, and reject it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-755) DomainURLFilter crashes on malformed URL

2009-11-28 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783299#action_12783299
 ] 

Andrzej Bialecki  commented on NUTCH-755:
-

I could not verify that the filter indeed crashes - it simply prints the 
exception and then returns null, as you suggested.

 DomainURLFilter crashes on malformed URL
 

 Key: NUTCH-755
 URL: https://issues.apache.org/jira/browse/NUTCH-755
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Tomcat 6.0.14
 Java 1.6.0_14
 Linux
Reporter: Mike Baranczak
Assignee: Andrzej Bialecki 

 2009-09-16 21:54:17,001 ERROR [Thread-156] DomainURLFilter - Could not apply 
 filter on url: http:/comments.php
 java.lang.NullPointerException
 at 
 org.apache.nutch.urlfilter.domain.DomainURLFilter.filter(DomainURLFilter.java:173)
 at org.apache.nutch.net.URLFilters.filter(URLFilters.java:88)
 at 
 org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:200)
 at 
 org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:113)
 at 
 org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:96)
 at 
 org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:70)
 at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
 at 
 org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
 Expected behavior would be to recognize the URL as malformed, and reject it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-11-28 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783302#action_12783302
 ] 

Andrzej Bialecki  commented on NUTCH-692:
-

We should review this issue after the upgrade to Hadoop 0.20 - task output mgmt 
differs there, and the problem may be nonexistent.

 AlreadyBeingCreatedException with Hadoop 0.19
 -

 Key: NUTCH-692
 URL: https://issues.apache.org/jira/browse/NUTCH-692
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Julien Nioche
 Attachments: NUTCH-692.patch


 I have been using the SVN version of Nutch on an EC2 cluster and got some 
 AlreadyBeingCreatedException during the reduce phase of a parse. For some 
 reason one of my tasks crashed and then I ran into this 
 AlreadyBeingCreatedException when other nodes tried to pick it up.
 There was recently a discussion on the Hadoop user list on similar issues 
 with Hadoop 0.19 (see 
 http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried 
 using 0.18.2 yet but will do if the problems persist with 0.19
 I was wondering whether anyone else had experienced the same problem. Do you 
 think 0.19 is stable enough to use it for Nutch 1.0?
 I will be running a crawl on a super large cluster in the next couple of 
 weeks and I will confirm this issue  
 J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-741) Job file includes multiple copies of nutch config files.

2009-11-28 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783304#action_12783304
 ] 

Andrzej Bialecki  commented on NUTCH-741:
-

Fixed in rev. 885156. Thank you!

 Job file includes multiple copies of nutch config files.
 

 Key: NUTCH-741
 URL: https://issues.apache.org/jira/browse/NUTCH-741
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Kirby Bohling
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: removeJobDupConf.diff


 From a clean checkout, running ant tar will create a .job file.  The .job 
 file includes two copies of the nutch-site.xml and nutch-default.xml file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-741) Job file includes multiple copies of nutch config files.

2009-11-28 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-741.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Andrzej Bialecki 

 Job file includes multiple copies of nutch config files.
 

 Key: NUTCH-741
 URL: https://issues.apache.org/jira/browse/NUTCH-741
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Kirby Bohling
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: removeJobDupConf.diff


 From a clean checkout, running ant tar will create a .job file.  The .job 
 file includes two copies of the nutch-site.xml and nutch-default.xml file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers

2009-11-28 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-712.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Andrzej Bialecki 

 ParseOutputFormat should catch java.net.MalformedURLException coming from 
 normalizers
 -

 Key: NUTCH-712
 URL: https://issues.apache.org/jira/browse/NUTCH-712
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: ParseOutputFormat-NUTCH712v2.patch


 ParseOutputFormat should catch java.net.MalformedURLException coming from 
 normalizers otherwise the whole parsing step crashes instead of simply 
 ignoring dodgy outlinks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers

2009-11-28 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783306#action_12783306
 ] 

Andrzej Bialecki  commented on NUTCH-712:
-

Fixed in rev. 885159. Thank you!

 ParseOutputFormat should catch java.net.MalformedURLException coming from 
 normalizers
 -

 Key: NUTCH-712
 URL: https://issues.apache.org/jira/browse/NUTCH-712
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: ParseOutputFormat-NUTCH712v2.patch


 ParseOutputFormat should catch java.net.MalformedURLException coming from 
 normalizers otherwise the whole parsing step crashes instead of simply 
 ignoring dodgy outlinks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[Nutch Wiki] Trivial Update of Automating_Fetches_wi th_Python by newacct

2009-11-28 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Automating_Fetches_with_Python page has been changed by newacct.
http://wiki.apache.org/nutch/Automating_Fetches_with_Python?action=diffrev1=5rev2=6

--

  import sys
  import getopt
  import re
- import string
  import logging
  import logging.config
  import commands
@@ -259, +258 @@

total_urls += 1
  urllinecount.close()
  numsplits = total_urls / splitsize
- padding = 0 * len(`numsplits`)
+ padding = 0 * len(repr(numsplits))
  
  # create the url load folder
- linenum = 0
  filenum = 0
- strfilenum = `filenum`
+ strfilenum = repr(filenum)
  urloutdir = outdir + /urls- + padding[len(strfilenum):] + strfilenum
  os.mkdir(urloutdir)
  urlfile = urloutdir + /urls
@@ -275, +273 @@

  outhandle = open(urlfile, w)
  
  # loop through the file
- for line in inhandle:
+ for linenum, line in enumerate(inhandle):
  
# if we have come to a split then close the current file, create a new
# url folder and open a new url file
-   if linenum  0 and (linenum % splitsize == 0):
+   if linenum  0 and linenum % splitsize == 0:
  
- filenum = filenum + 1
+ filenum += 1
- strfilenum = `filenum`
+ strfilenum = repr(filenum)
  urloutdir = outdir + /urls- + padding[len(strfilenum):] + strfilenum
  os.mkdir(urloutdir)
  urlfile = urloutdir + /urls
@@ -290, +288 @@

  outhandle.close()
  outhandle = open(urlfile, w)
  
-   # write the url to the file and increase the number of lines read
+   # write the url to the file
outhandle.write(line)
-   linenum = linenum + 1
  
  # close the input and output files
  inhandle.close()
@@ -362, +359 @@

  
  # fetch the current segment
  outar = result[1].splitlines()
- output = outar[len(outar) - 1]
+ output = outar[-1]
- tempseg = string.split(output)[0]
+ tempseg = output.split()[0]
  tempseglist.append(tempseg)
  fetch = self.nutchdir + /bin/nutch fetch  + tempseg
  self.log.info(Starting fetch for:  + tempseg)
@@ -392, +389 @@

  
# merge the crawldbs
self.log.info(Merging master and temp crawldbs.)
-   crawlmerge = (self.nutchdir + /bin/nutch mergedb mergetemp/crawldb  +
+   crawlmerge = self.nutchdir + /bin/nutch mergedb mergetemp/crawldb  + \
- mastercrawldbdir +   + string.join(tempdblist,  ))
+ mastercrawldbdir +   +  .join(tempdblist)
self.log.info(Running:  + crawlmerge)
result = commands.getstatusoutput(crawlmerge)
self.checkStatus(result, Error occurred while running command + 
crawlmerge)
@@ -404, +401 @@

result = commands.getstatusoutput(getsegment)
self.checkStatus(result, Error occurred while running command + 
getsegment)
outar = result[1].splitlines()
-   output = outar[len(outar) - 1]
+   output = outar[-1]
-   masterseg = string.split(output)[0]
+   masterseg = output.split()[0]
-   mergesegs = (self.nutchdir + /bin/nutch mergesegs mergetemp/segments  
+
+   mergesegs = self.nutchdir + /bin/nutch mergesegs mergetemp/segments  
+ \
- masterseg +   + string.join(tempseglist,  ))
+ masterseg +   +  .join(tempseglist)
self.log.info(Running:  + mergesegs)
result = commands.getstatusoutput(mergesegs)
self.checkStatus(result, Error occurred while running command + 
mergesegs)
@@ -464, +461 @@

usage.append([-b | --backupdir] The master backup directory, 
[crawl-backup].\n)
usage.append([-s | --splitsize] The number of urls per load 
[50].\n)
usage.append([-f | --fetchmerge] The number of fetches to run 
before merging [1].\n)
-   message = string.join(usage)
+   message =  .join(usage)
print message

[jira] Commented: (NUTCH-738) Close SegmentUpdater when FetchedSegments is closed

2009-11-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783359#action_12783359
 ] 

Hudson commented on NUTCH-738:
--

Integrated in Nutch-trunk #996 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/996/])
 Close SegmentUpdater when FetchedSegments is closed.


 Close SegmentUpdater when FetchedSegments is closed
 ---

 Key: NUTCH-738
 URL: https://issues.apache.org/jira/browse/NUTCH-738
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Martina Koch
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: FetchedSegments.patch, NUTCH-738.patch


 Currently FetchedSegments starts a SegmentUpdater, but never closes it when 
 FetchedSegments is closed.
 (The problem was described in this mailing: 
 http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg13823.html)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-741) Job file includes multiple copies of nutch config files.

2009-11-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783357#action_12783357
 ] 

Hudson commented on NUTCH-741:
--

Integrated in Nutch-trunk #996 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/996/])
 Job file includes multiple copies of nutch config files.


 Job file includes multiple copies of nutch config files.
 

 Key: NUTCH-741
 URL: https://issues.apache.org/jira/browse/NUTCH-741
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Kirby Bohling
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: removeJobDupConf.diff


 From a clean checkout, running ant tar will create a .job file.  The .job 
 file includes two copies of the nutch-site.xml and nutch-default.xml file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers

2009-11-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783360#action_12783360
 ] 

Hudson commented on NUTCH-712:
--

Integrated in Nutch-trunk #996 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/996/])
 ParseOutputFormat should catch java.net.MalformedURLException
coming from normalizers.


 ParseOutputFormat should catch java.net.MalformedURLException coming from 
 normalizers
 -

 Key: NUTCH-712
 URL: https://issues.apache.org/jira/browse/NUTCH-712
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: ParseOutputFormat-NUTCH712v2.patch


 ParseOutputFormat should catch java.net.MalformedURLException coming from 
 normalizers otherwise the whole parsing step crashes instead of simply 
 ignoring dodgy outlinks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-746) NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container.

2009-11-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783356#action_12783356
 ] 

Hudson commented on NUTCH-746:
--

Integrated in Nutch-trunk #996 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/996/])
 NutchBeanConstructor does not close NutchBean upon
contextDestroyed, causing resource leak in the container.


 NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing 
 resource leak in the container.
 

 Key: NUTCH-746
 URL: https://issues.apache.org/jira/browse/NUTCH-746
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
 Environment: Apache Tomcat 5.5.27 and 6.0.18, Fedora 11, OpenJDK or 
 Sun JDK 1.6 OpenJDK 64-Bit Server VM (build 14.0-b15, mixed mode)
Reporter: Kirby Bohling
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-746.patch


 NutchBeanConstructor is not cleaning up upon application shutdown 
 (contextDestroyed()).   It leaves open the SegmentUpdater, and potentially 
 other resources.  This causes the WebApp's classloader to not be able to 
 GC'ed in Tomcat, which after repeated restarts will lead to a PermGen error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-11-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783358#action_12783358
 ] 

Hudson commented on NUTCH-739:
--

Integrated in Nutch-trunk #996 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/996/])
 SolrDeleteDuplications too slow when using hadoop.


 SolrDeleteDuplications too slow when using hadoop
 -

 Key: NUTCH-739
 URL: https://issues.apache.org/jira/browse/NUTCH-739
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: hadoop cluster with 3 nodes
 Map Task Capacity: 6
 Reduce Task Capacity: 6
 Indexer: one instance of solr server (on the one of slave nodes)
Reporter: Dmitry Lihachev
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch


 in my environment i always have many warnings like this on the dedup step
 {noformat}
 Task attempt_200905270022_0212_r_03_0 failed to report status for 600 
 seconds. Killing!
 {noformat}
 solr logs:
 {noformat}
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173741
 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: {optimize=} 0 173599
 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173599
 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing searc...@2ad9ac58 main
 May 27, 2009 10:29:27 AM 
 org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
 WARNING: Could not getStatistics on info bean 
 org.apache.solr.search.SolrIndexSearcher
 org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
 
 {noformat}
 So I think the problem in the piece of code on line 301 of 
 SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
 each of ones tries to optimize solr indexes before closing.
 The simplest way to avoid this bug - removing this line and sending 
 optimize/ message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-753) Prevent new Fetcher to retrieve the robots twice

[jira] Commented: (NUTCH-773) some minor bugs in AbstractFetchSchedule.java

[jira] Commented: (NUTCH-772) Upgrade Nutch to use Lucene 2.9.1

[jira] Commented: (NUTCH-760) Allow field mapping from nutch to solr index

[jira] Commented: (NUTCH-765) Allow Crawl class to call Either Solr or Lucene Indexer

[jira] Commented: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

[jira] Updated: (NUTCH-770) Timebomb for Fetcher

[jira] Updated: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions

[jira] Commented: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions

[jira] Commented: (NUTCH-770) Timebomb for Fetcher

[jira] Commented: (NUTCH-770) Timebomb for Fetcher

[jira] Commented: (NUTCH-770) Timebomb for Fetcher

[jira] Closed: (NUTCH-746) NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container.

[jira] Commented: (NUTCH-746) NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container.

[jira] Closed: (NUTCH-738) Close SegmentUpdater when FetchedSegments is closed

[jira] Closed: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

[jira] Closed: (NUTCH-755) DomainURLFilter crashes on malformed URL

[jira] Commented: (NUTCH-755) DomainURLFilter crashes on malformed URL

[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

[jira] Commented: (NUTCH-741) Job file includes multiple copies of nutch config files.

[jira] Closed: (NUTCH-741) Job file includes multiple copies of nutch config files.

[jira] Closed: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers

[jira] Commented: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers

[Nutch Wiki] Trivial Update of Automating_Fetches_wi th_Python by newacct

[jira] Commented: (NUTCH-738) Close SegmentUpdater when FetchedSegments is closed

[jira] Commented: (NUTCH-741) Job file includes multiple copies of nutch config files.

[jira] Commented: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers

[jira] Commented: (NUTCH-746) NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container.

[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

30 matches

Site Navigation

Mail list logo

Footer information