[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783290#action_12783290 ] Andrzej Bialecki commented on NUTCH-739: - Fixed in rev. 885152. Thank you! SolrDeleteDuplications too slow when using hadoop - Key: NUTCH-739 URL: https://issues.apache.org/jira/browse/NUTCH-739 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: hadoop cluster with 3 nodes Map Task Capacity: 6 Reduce Task Capacity: 6 Indexer: one instance of solr server (on the one of slave nodes) Reporter: Dmitry Lihachev Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch in my environment i always have many warnings like this on the dedup step {noformat} Task attempt_200905270022_0212_r_03_0 failed to report status for 600 seconds. Killing! {noformat} solr logs: {noformat} INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173741 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {optimize=} 0 173599 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173599 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close INFO: Closing searc...@2ad9ac58 main May 27, 2009 10:29:27 AM org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo WARNING: Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed {noformat} So I think the problem in the piece of code on line 301 of SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks each of ones tries to optimize solr indexes before closing. The simplest way to avoid this bug - removing this line and sending optimize/ message directly to solr server after dedup step -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783358#action_12783358 ] Hudson commented on NUTCH-739: -- Integrated in Nutch-trunk #996 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/996/]) SolrDeleteDuplications too slow when using hadoop. SolrDeleteDuplications too slow when using hadoop - Key: NUTCH-739 URL: https://issues.apache.org/jira/browse/NUTCH-739 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: hadoop cluster with 3 nodes Map Task Capacity: 6 Reduce Task Capacity: 6 Indexer: one instance of solr server (on the one of slave nodes) Reporter: Dmitry Lihachev Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch in my environment i always have many warnings like this on the dedup step {noformat} Task attempt_200905270022_0212_r_03_0 failed to report status for 600 seconds. Killing! {noformat} solr logs: {noformat} INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173741 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {optimize=} 0 173599 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173599 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close INFO: Closing searc...@2ad9ac58 main May 27, 2009 10:29:27 AM org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo WARNING: Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed {noformat} So I think the problem in the piece of code on line 301 of SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks each of ones tries to optimize solr indexes before closing. The simplest way to avoid this bug - removing this line and sending optimize/ message directly to solr server after dedup step -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714327#action_12714327 ] Doğacan Güney commented on NUTCH-739: - I agree with Dmitry. We should not need more than 1 optimize call, it was my mistake to not consider the case of multiple tasks all trying to optimize at the same time. I am ready to be proven wrong (or right, depending on your POV :) However, I still believe that we should not require users to use curl directly. Can't we just move the optimize call to somewhere after the job is finished? SolrDeleteDuplications too slow when using hadoop - Key: NUTCH-739 URL: https://issues.apache.org/jira/browse/NUTCH-739 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: hadoop cluster with 3 nodes Map Task Capacity: 6 Reduce Task Capacity: 6 Indexer: one instance of solr server (on the one of slave nodes) Reporter: Dmitry Lihachev Fix For: 1.1 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch in my environment i always have many warnings like this on the dedup step {noformat} Task attempt_200905270022_0212_r_03_0 failed to report status for 600 seconds. Killing! {noformat} solr logs: {noformat} INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173741 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {optimize=} 0 173599 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173599 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close INFO: Closing searc...@2ad9ac58 main May 27, 2009 10:29:27 AM org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo WARNING: Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed {noformat} So I think the problem in the piece of code on line 301 of SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks each of ones tries to optimize solr indexes before closing. The simplest way to avoid this bug - removing this line and sending optimize/ message directly to solr server after dedup step -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714346#action_12714346 ] Dmitry Lihachev commented on NUTCH-739: --- Doğacan, I agree with you about curl usage. May by we must write Tool SolrOpimizer in org.apache.nutch.indexer.solr and call this tool from bin/nutch? SolrDeleteDuplications too slow when using hadoop - Key: NUTCH-739 URL: https://issues.apache.org/jira/browse/NUTCH-739 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: hadoop cluster with 3 nodes Map Task Capacity: 6 Reduce Task Capacity: 6 Indexer: one instance of solr server (on the one of slave nodes) Reporter: Dmitry Lihachev Fix For: 1.1 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch in my environment i always have many warnings like this on the dedup step {noformat} Task attempt_200905270022_0212_r_03_0 failed to report status for 600 seconds. Killing! {noformat} solr logs: {noformat} INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173741 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {optimize=} 0 173599 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173599 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close INFO: Closing searc...@2ad9ac58 main May 27, 2009 10:29:27 AM org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo WARNING: Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed {noformat} So I think the problem in the piece of code on line 301 of SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks each of ones tries to optimize solr indexes before closing. The simplest way to avoid this bug - removing this line and sending optimize/ message directly to solr server after dedup step -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714349#action_12714349 ] Dmitry Lihachev commented on NUTCH-739: --- Ooops, sorry... Tool is Map/Reduce application. Ok, we can write standard Java application with main method and run it from bin/nutch SolrDeleteDuplications too slow when using hadoop - Key: NUTCH-739 URL: https://issues.apache.org/jira/browse/NUTCH-739 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: hadoop cluster with 3 nodes Map Task Capacity: 6 Reduce Task Capacity: 6 Indexer: one instance of solr server (on the one of slave nodes) Reporter: Dmitry Lihachev Fix For: 1.1 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch in my environment i always have many warnings like this on the dedup step {noformat} Task attempt_200905270022_0212_r_03_0 failed to report status for 600 seconds. Killing! {noformat} solr logs: {noformat} INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173741 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {optimize=} 0 173599 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173599 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close INFO: Closing searc...@2ad9ac58 main May 27, 2009 10:29:27 AM org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo WARNING: Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed {noformat} So I think the problem in the piece of code on line 301 of SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks each of ones tries to optimize solr indexes before closing. The simplest way to avoid this bug - removing this line and sending optimize/ message directly to solr server after dedup step -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714536#action_12714536 ] Otis Gospodnetic commented on NUTCH-739: Yeah, sounds right. That Tool should make use of SolrJ then, we already have it as a dependency in lib/. SolrDeleteDuplications too slow when using hadoop - Key: NUTCH-739 URL: https://issues.apache.org/jira/browse/NUTCH-739 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: hadoop cluster with 3 nodes Map Task Capacity: 6 Reduce Task Capacity: 6 Indexer: one instance of solr server (on the one of slave nodes) Reporter: Dmitry Lihachev Fix For: 1.1 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch in my environment i always have many warnings like this on the dedup step {noformat} Task attempt_200905270022_0212_r_03_0 failed to report status for 600 seconds. Killing! {noformat} solr logs: {noformat} INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173741 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {optimize=} 0 173599 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173599 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close INFO: Closing searc...@2ad9ac58 main May 27, 2009 10:29:27 AM org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo WARNING: Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed {noformat} So I think the problem in the piece of code on line 301 of SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks each of ones tries to optimize solr indexes before closing. The simplest way to avoid this bug - removing this line and sending optimize/ message directly to solr server after dedup step -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714264#action_12714264 ] Dmitry Lihachev commented on NUTCH-739: --- in my recrawl script I have following lines {code} server=http://some.server.org bin/nutch solrdedup $server curl -so /dev/null -H 'Content-Type: text/xml' -d optimize/ $server/update {code} You can always send commands to solr without Java SolrDeleteDuplications too slow when using hadoop - Key: NUTCH-739 URL: https://issues.apache.org/jira/browse/NUTCH-739 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: hadoop cluster with 3 nodes Map Task Capacity: 6 Reduce Task Capacity: 6 Indexer: one instance of solr server (on the one of slave nodes) Reporter: Dmitry Lihachev Fix For: 1.1 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch in my environment i always have many warnings like this on the dedup step {noformat} Task attempt_200905270022_0212_r_03_0 failed to report status for 600 seconds. Killing! {noformat} solr logs: {noformat} INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173741 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {optimize=} 0 173599 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173599 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close INFO: Closing searc...@2ad9ac58 main May 27, 2009 10:29:27 AM org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo WARNING: Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed {noformat} So I think the problem in the piece of code on line 301 of SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks each of ones tries to optimize solr indexes before closing. The simplest way to avoid this bug - removing this line and sending optimize/ message directly to solr server after dedup step -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714277#action_12714277 ] Ken Krugler commented on NUTCH-739: --- There's another approach that works well here, and that's to start up a thread that calls the Hadoop reporter while the optimize is happening. We ran into the same issue when optimizing large Lucene indexes from our Bixo IndexScheme tap for Cascading. You can find that code on GitHub, but the skeleton is to do something like this in the reducer's close() method - assuming you've stashed the reporter from the reduce() call: {code:java} // Hadoop needs to know we still working on it. Thread reporterThread = new Thread() { public void run() { while (!isInterrupted()) { reporter.progress(); try { sleep(10 * 1000); } catch (InterruptedException e) { interrupt(); } } } }; reporterThread.start(); indexWriter.optimize(); and other lengthy tasks here reporterThread.interrupt(); {code} SolrDeleteDuplications too slow when using hadoop - Key: NUTCH-739 URL: https://issues.apache.org/jira/browse/NUTCH-739 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: hadoop cluster with 3 nodes Map Task Capacity: 6 Reduce Task Capacity: 6 Indexer: one instance of solr server (on the one of slave nodes) Reporter: Dmitry Lihachev Fix For: 1.1 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch in my environment i always have many warnings like this on the dedup step {noformat} Task attempt_200905270022_0212_r_03_0 failed to report status for 600 seconds. Killing! {noformat} solr logs: {noformat} INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173741 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {optimize=} 0 173599 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173599 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close INFO: Closing searc...@2ad9ac58 main May 27, 2009 10:29:27 AM org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo WARNING: Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed {noformat} So I think the problem in the piece of code on line 301 of SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks each of ones tries to optimize solr indexes before closing. The simplest way to avoid this bug - removing this line and sending optimize/ message directly to solr server after dedup step -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714286#action_12714286 ] Otis Gospodnetic commented on NUTCH-739: Yes, external optimize calls will work, I was just wondering if we could avoid that. I like Ken's suggestion that tells Hadoop the task is still alive. Do you think you could do that in your patch, Dmitry? SolrDeleteDuplications too slow when using hadoop - Key: NUTCH-739 URL: https://issues.apache.org/jira/browse/NUTCH-739 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: hadoop cluster with 3 nodes Map Task Capacity: 6 Reduce Task Capacity: 6 Indexer: one instance of solr server (on the one of slave nodes) Reporter: Dmitry Lihachev Fix For: 1.1 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch in my environment i always have many warnings like this on the dedup step {noformat} Task attempt_200905270022_0212_r_03_0 failed to report status for 600 seconds. Killing! {noformat} solr logs: {noformat} INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173741 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {optimize=} 0 173599 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173599 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close INFO: Closing searc...@2ad9ac58 main May 27, 2009 10:29:27 AM org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo WARNING: Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed {noformat} So I think the problem in the piece of code on line 301 of SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks each of ones tries to optimize solr indexes before closing. The simplest way to avoid this bug - removing this line and sending optimize/ message directly to solr server after dedup step -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714287#action_12714287 ] Dmitry Lihachev commented on NUTCH-739: --- with this approach we still have few optimize calls (so many so we have reducers), but we need exactly one optimize call after dedup SolrDeleteDuplications too slow when using hadoop - Key: NUTCH-739 URL: https://issues.apache.org/jira/browse/NUTCH-739 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: hadoop cluster with 3 nodes Map Task Capacity: 6 Reduce Task Capacity: 6 Indexer: one instance of solr server (on the one of slave nodes) Reporter: Dmitry Lihachev Fix For: 1.1 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch in my environment i always have many warnings like this on the dedup step {noformat} Task attempt_200905270022_0212_r_03_0 failed to report status for 600 seconds. Killing! {noformat} solr logs: {noformat} INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173741 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {optimize=} 0 173599 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173599 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close INFO: Closing searc...@2ad9ac58 main May 27, 2009 10:29:27 AM org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo WARNING: Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed {noformat} So I think the problem in the piece of code on line 301 of SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks each of ones tries to optimize solr indexes before closing. The simplest way to avoid this bug - removing this line and sending optimize/ message directly to solr server after dedup step -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714288#action_12714288 ] Dmitry Lihachev commented on NUTCH-739: --- am I wrong? SolrDeleteDuplications too slow when using hadoop - Key: NUTCH-739 URL: https://issues.apache.org/jira/browse/NUTCH-739 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: hadoop cluster with 3 nodes Map Task Capacity: 6 Reduce Task Capacity: 6 Indexer: one instance of solr server (on the one of slave nodes) Reporter: Dmitry Lihachev Fix For: 1.1 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch in my environment i always have many warnings like this on the dedup step {noformat} Task attempt_200905270022_0212_r_03_0 failed to report status for 600 seconds. Killing! {noformat} solr logs: {noformat} INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173741 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {optimize=} 0 173599 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173599 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close INFO: Closing searc...@2ad9ac58 main May 27, 2009 10:29:27 AM org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo WARNING: Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed {noformat} So I think the problem in the piece of code on line 301 of SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks each of ones tries to optimize solr indexes before closing. The simplest way to avoid this bug - removing this line and sending optimize/ message directly to solr server after dedup step -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714290#action_12714290 ] Dmitry Lihachev commented on NUTCH-739: --- I think that optimizing solr - is not hadoop job. it does not need parallelization. SolrDeleteDuplications too slow when using hadoop - Key: NUTCH-739 URL: https://issues.apache.org/jira/browse/NUTCH-739 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: hadoop cluster with 3 nodes Map Task Capacity: 6 Reduce Task Capacity: 6 Indexer: one instance of solr server (on the one of slave nodes) Reporter: Dmitry Lihachev Fix For: 1.1 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch in my environment i always have many warnings like this on the dedup step {noformat} Task attempt_200905270022_0212_r_03_0 failed to report status for 600 seconds. Killing! {noformat} solr logs: {noformat} INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173741 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {optimize=} 0 173599 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173599 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close INFO: Closing searc...@2ad9ac58 main May 27, 2009 10:29:27 AM org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo WARNING: Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed {noformat} So I think the problem in the piece of code on line 301 of SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks each of ones tries to optimize solr indexes before closing. The simplest way to avoid this bug - removing this line and sending optimize/ message directly to solr server after dedup step -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.