[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-11-28 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783290#action_12783290
 ] 

Andrzej Bialecki  commented on NUTCH-739:
-

Fixed in rev. 885152. Thank you!

 SolrDeleteDuplications too slow when using hadoop
 -

 Key: NUTCH-739
 URL: https://issues.apache.org/jira/browse/NUTCH-739
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: hadoop cluster with 3 nodes
 Map Task Capacity: 6
 Reduce Task Capacity: 6
 Indexer: one instance of solr server (on the one of slave nodes)
Reporter: Dmitry Lihachev
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch


 in my environment i always have many warnings like this on the dedup step
 {noformat}
 Task attempt_200905270022_0212_r_03_0 failed to report status for 600 
 seconds. Killing!
 {noformat}
 solr logs:
 {noformat}
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173741
 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: {optimize=} 0 173599
 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173599
 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing searc...@2ad9ac58 main
 May 27, 2009 10:29:27 AM 
 org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
 WARNING: Could not getStatistics on info bean 
 org.apache.solr.search.SolrIndexSearcher
 org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
 
 {noformat}
 So I think the problem in the piece of code on line 301 of 
 SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
 each of ones tries to optimize solr indexes before closing.
 The simplest way to avoid this bug - removing this line and sending 
 optimize/ message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-11-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783358#action_12783358
 ] 

Hudson commented on NUTCH-739:
--

Integrated in Nutch-trunk #996 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/996/])
 SolrDeleteDuplications too slow when using hadoop.


 SolrDeleteDuplications too slow when using hadoop
 -

 Key: NUTCH-739
 URL: https://issues.apache.org/jira/browse/NUTCH-739
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: hadoop cluster with 3 nodes
 Map Task Capacity: 6
 Reduce Task Capacity: 6
 Indexer: one instance of solr server (on the one of slave nodes)
Reporter: Dmitry Lihachev
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch


 in my environment i always have many warnings like this on the dedup step
 {noformat}
 Task attempt_200905270022_0212_r_03_0 failed to report status for 600 
 seconds. Killing!
 {noformat}
 solr logs:
 {noformat}
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173741
 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: {optimize=} 0 173599
 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173599
 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing searc...@2ad9ac58 main
 May 27, 2009 10:29:27 AM 
 org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
 WARNING: Could not getStatistics on info bean 
 org.apache.solr.search.SolrIndexSearcher
 org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
 
 {noformat}
 So I think the problem in the piece of code on line 301 of 
 SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
 each of ones tries to optimize solr indexes before closing.
 The simplest way to avoid this bug - removing this line and sending 
 optimize/ message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-05-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714327#action_12714327
 ] 

Doğacan Güney commented on NUTCH-739:
-

I agree with Dmitry. We should not need more than 1 optimize call, it was my 
mistake to not
consider the case of multiple tasks all trying to optimize at the same time. I 
am ready to be
proven wrong (or right, depending on your POV :)

However, I still believe that we should not require users to use curl directly. 
Can't we just
move the optimize call to somewhere after the job is finished? 

 SolrDeleteDuplications too slow when using hadoop
 -

 Key: NUTCH-739
 URL: https://issues.apache.org/jira/browse/NUTCH-739
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: hadoop cluster with 3 nodes
 Map Task Capacity: 6
 Reduce Task Capacity: 6
 Indexer: one instance of solr server (on the one of slave nodes)
Reporter: Dmitry Lihachev
 Fix For: 1.1

 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch


 in my environment i always have many warnings like this on the dedup step
 {noformat}
 Task attempt_200905270022_0212_r_03_0 failed to report status for 600 
 seconds. Killing!
 {noformat}
 solr logs:
 {noformat}
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173741
 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: {optimize=} 0 173599
 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173599
 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing searc...@2ad9ac58 main
 May 27, 2009 10:29:27 AM 
 org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
 WARNING: Could not getStatistics on info bean 
 org.apache.solr.search.SolrIndexSearcher
 org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
 
 {noformat}
 So I think the problem in the piece of code on line 301 of 
 SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
 each of ones tries to optimize solr indexes before closing.
 The simplest way to avoid this bug - removing this line and sending 
 optimize/ message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-05-29 Thread Dmitry Lihachev (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714346#action_12714346
 ] 

Dmitry Lihachev commented on NUTCH-739:
---

Doğacan, I agree with you about curl usage. May by we must write Tool 
SolrOpimizer in org.apache.nutch.indexer.solr and call this tool from 
bin/nutch? 

 SolrDeleteDuplications too slow when using hadoop
 -

 Key: NUTCH-739
 URL: https://issues.apache.org/jira/browse/NUTCH-739
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: hadoop cluster with 3 nodes
 Map Task Capacity: 6
 Reduce Task Capacity: 6
 Indexer: one instance of solr server (on the one of slave nodes)
Reporter: Dmitry Lihachev
 Fix For: 1.1

 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch


 in my environment i always have many warnings like this on the dedup step
 {noformat}
 Task attempt_200905270022_0212_r_03_0 failed to report status for 600 
 seconds. Killing!
 {noformat}
 solr logs:
 {noformat}
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173741
 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: {optimize=} 0 173599
 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173599
 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing searc...@2ad9ac58 main
 May 27, 2009 10:29:27 AM 
 org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
 WARNING: Could not getStatistics on info bean 
 org.apache.solr.search.SolrIndexSearcher
 org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
 
 {noformat}
 So I think the problem in the piece of code on line 301 of 
 SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
 each of ones tries to optimize solr indexes before closing.
 The simplest way to avoid this bug - removing this line and sending 
 optimize/ message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-05-29 Thread Dmitry Lihachev (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714349#action_12714349
 ] 

Dmitry Lihachev commented on NUTCH-739:
---

Ooops, sorry... Tool is Map/Reduce application.
Ok, we can write standard Java application with main method and run it from 
bin/nutch

 SolrDeleteDuplications too slow when using hadoop
 -

 Key: NUTCH-739
 URL: https://issues.apache.org/jira/browse/NUTCH-739
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: hadoop cluster with 3 nodes
 Map Task Capacity: 6
 Reduce Task Capacity: 6
 Indexer: one instance of solr server (on the one of slave nodes)
Reporter: Dmitry Lihachev
 Fix For: 1.1

 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch


 in my environment i always have many warnings like this on the dedup step
 {noformat}
 Task attempt_200905270022_0212_r_03_0 failed to report status for 600 
 seconds. Killing!
 {noformat}
 solr logs:
 {noformat}
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173741
 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: {optimize=} 0 173599
 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173599
 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing searc...@2ad9ac58 main
 May 27, 2009 10:29:27 AM 
 org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
 WARNING: Could not getStatistics on info bean 
 org.apache.solr.search.SolrIndexSearcher
 org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
 
 {noformat}
 So I think the problem in the piece of code on line 301 of 
 SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
 each of ones tries to optimize solr indexes before closing.
 The simplest way to avoid this bug - removing this line and sending 
 optimize/ message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-05-29 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714536#action_12714536
 ] 

Otis Gospodnetic commented on NUTCH-739:


Yeah, sounds right.  That Tool should make use of SolrJ then, we already have 
it as a dependency in lib/.


 SolrDeleteDuplications too slow when using hadoop
 -

 Key: NUTCH-739
 URL: https://issues.apache.org/jira/browse/NUTCH-739
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: hadoop cluster with 3 nodes
 Map Task Capacity: 6
 Reduce Task Capacity: 6
 Indexer: one instance of solr server (on the one of slave nodes)
Reporter: Dmitry Lihachev
 Fix For: 1.1

 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch


 in my environment i always have many warnings like this on the dedup step
 {noformat}
 Task attempt_200905270022_0212_r_03_0 failed to report status for 600 
 seconds. Killing!
 {noformat}
 solr logs:
 {noformat}
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173741
 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: {optimize=} 0 173599
 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173599
 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing searc...@2ad9ac58 main
 May 27, 2009 10:29:27 AM 
 org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
 WARNING: Could not getStatistics on info bean 
 org.apache.solr.search.SolrIndexSearcher
 org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
 
 {noformat}
 So I think the problem in the piece of code on line 301 of 
 SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
 each of ones tries to optimize solr indexes before closing.
 The simplest way to avoid this bug - removing this line and sending 
 optimize/ message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-05-28 Thread Dmitry Lihachev (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714264#action_12714264
 ] 

Dmitry Lihachev commented on NUTCH-739:
---

in my recrawl script I have following lines
{code}
server=http://some.server.org
bin/nutch solrdedup $server
curl -so /dev/null -H 'Content-Type: text/xml' -d optimize/ $server/update
{code}

You can always send commands to solr without Java

 SolrDeleteDuplications too slow when using hadoop
 -

 Key: NUTCH-739
 URL: https://issues.apache.org/jira/browse/NUTCH-739
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: hadoop cluster with 3 nodes
 Map Task Capacity: 6
 Reduce Task Capacity: 6
 Indexer: one instance of solr server (on the one of slave nodes)
Reporter: Dmitry Lihachev
 Fix For: 1.1

 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch


 in my environment i always have many warnings like this on the dedup step
 {noformat}
 Task attempt_200905270022_0212_r_03_0 failed to report status for 600 
 seconds. Killing!
 {noformat}
 solr logs:
 {noformat}
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173741
 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: {optimize=} 0 173599
 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173599
 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing searc...@2ad9ac58 main
 May 27, 2009 10:29:27 AM 
 org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
 WARNING: Could not getStatistics on info bean 
 org.apache.solr.search.SolrIndexSearcher
 org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
 
 {noformat}
 So I think the problem in the piece of code on line 301 of 
 SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
 each of ones tries to optimize solr indexes before closing.
 The simplest way to avoid this bug - removing this line and sending 
 optimize/ message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-05-28 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714277#action_12714277
 ] 

Ken Krugler commented on NUTCH-739:
---

There's another approach that works well here, and that's to start up a thread 
that calls the Hadoop reporter while the optimize is happening.

We ran into the same issue when optimizing large Lucene indexes from our Bixo 
IndexScheme tap for Cascading. You can find that code on GitHub, but the 
skeleton is to do something like this in the reducer's close() method - 
assuming you've stashed the reporter from the reduce() call:

{code:java}
// Hadoop needs to know we still working on it.
Thread reporterThread = new Thread() {
public void run() {
while (!isInterrupted()) {
reporter.progress();
try {
sleep(10 * 1000);
} catch (InterruptedException e) {
interrupt();
}
}
}
};
reporterThread.start();

indexWriter.optimize();
and other lengthy tasks here
reporterThread.interrupt();
{code}



 SolrDeleteDuplications too slow when using hadoop
 -

 Key: NUTCH-739
 URL: https://issues.apache.org/jira/browse/NUTCH-739
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: hadoop cluster with 3 nodes
 Map Task Capacity: 6
 Reduce Task Capacity: 6
 Indexer: one instance of solr server (on the one of slave nodes)
Reporter: Dmitry Lihachev
 Fix For: 1.1

 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch


 in my environment i always have many warnings like this on the dedup step
 {noformat}
 Task attempt_200905270022_0212_r_03_0 failed to report status for 600 
 seconds. Killing!
 {noformat}
 solr logs:
 {noformat}
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173741
 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: {optimize=} 0 173599
 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173599
 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing searc...@2ad9ac58 main
 May 27, 2009 10:29:27 AM 
 org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
 WARNING: Could not getStatistics on info bean 
 org.apache.solr.search.SolrIndexSearcher
 org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
 
 {noformat}
 So I think the problem in the piece of code on line 301 of 
 SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
 each of ones tries to optimize solr indexes before closing.
 The simplest way to avoid this bug - removing this line and sending 
 optimize/ message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-05-28 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714286#action_12714286
 ] 

Otis Gospodnetic commented on NUTCH-739:


Yes, external optimize calls will work, I was just wondering if we could avoid 
that.  I like Ken's suggestion that tells Hadoop the task is still alive.  Do 
you think you could do that in your patch, Dmitry?

 SolrDeleteDuplications too slow when using hadoop
 -

 Key: NUTCH-739
 URL: https://issues.apache.org/jira/browse/NUTCH-739
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: hadoop cluster with 3 nodes
 Map Task Capacity: 6
 Reduce Task Capacity: 6
 Indexer: one instance of solr server (on the one of slave nodes)
Reporter: Dmitry Lihachev
 Fix For: 1.1

 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch


 in my environment i always have many warnings like this on the dedup step
 {noformat}
 Task attempt_200905270022_0212_r_03_0 failed to report status for 600 
 seconds. Killing!
 {noformat}
 solr logs:
 {noformat}
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173741
 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: {optimize=} 0 173599
 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173599
 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing searc...@2ad9ac58 main
 May 27, 2009 10:29:27 AM 
 org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
 WARNING: Could not getStatistics on info bean 
 org.apache.solr.search.SolrIndexSearcher
 org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
 
 {noformat}
 So I think the problem in the piece of code on line 301 of 
 SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
 each of ones tries to optimize solr indexes before closing.
 The simplest way to avoid this bug - removing this line and sending 
 optimize/ message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-05-28 Thread Dmitry Lihachev (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714287#action_12714287
 ] 

Dmitry Lihachev commented on NUTCH-739:
---

with this approach we still have few optimize calls (so many so we have 
reducers), but we need exactly one optimize call after dedup

 SolrDeleteDuplications too slow when using hadoop
 -

 Key: NUTCH-739
 URL: https://issues.apache.org/jira/browse/NUTCH-739
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: hadoop cluster with 3 nodes
 Map Task Capacity: 6
 Reduce Task Capacity: 6
 Indexer: one instance of solr server (on the one of slave nodes)
Reporter: Dmitry Lihachev
 Fix For: 1.1

 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch


 in my environment i always have many warnings like this on the dedup step
 {noformat}
 Task attempt_200905270022_0212_r_03_0 failed to report status for 600 
 seconds. Killing!
 {noformat}
 solr logs:
 {noformat}
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173741
 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: {optimize=} 0 173599
 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173599
 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing searc...@2ad9ac58 main
 May 27, 2009 10:29:27 AM 
 org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
 WARNING: Could not getStatistics on info bean 
 org.apache.solr.search.SolrIndexSearcher
 org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
 
 {noformat}
 So I think the problem in the piece of code on line 301 of 
 SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
 each of ones tries to optimize solr indexes before closing.
 The simplest way to avoid this bug - removing this line and sending 
 optimize/ message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-05-28 Thread Dmitry Lihachev (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714288#action_12714288
 ] 

Dmitry Lihachev commented on NUTCH-739:
---

am I wrong?

 SolrDeleteDuplications too slow when using hadoop
 -

 Key: NUTCH-739
 URL: https://issues.apache.org/jira/browse/NUTCH-739
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: hadoop cluster with 3 nodes
 Map Task Capacity: 6
 Reduce Task Capacity: 6
 Indexer: one instance of solr server (on the one of slave nodes)
Reporter: Dmitry Lihachev
 Fix For: 1.1

 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch


 in my environment i always have many warnings like this on the dedup step
 {noformat}
 Task attempt_200905270022_0212_r_03_0 failed to report status for 600 
 seconds. Killing!
 {noformat}
 solr logs:
 {noformat}
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173741
 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: {optimize=} 0 173599
 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173599
 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing searc...@2ad9ac58 main
 May 27, 2009 10:29:27 AM 
 org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
 WARNING: Could not getStatistics on info bean 
 org.apache.solr.search.SolrIndexSearcher
 org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
 
 {noformat}
 So I think the problem in the piece of code on line 301 of 
 SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
 each of ones tries to optimize solr indexes before closing.
 The simplest way to avoid this bug - removing this line and sending 
 optimize/ message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-05-28 Thread Dmitry Lihachev (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714290#action_12714290
 ] 

Dmitry Lihachev commented on NUTCH-739:
---

I think that optimizing solr - is not hadoop job. it does not need 
parallelization.

 SolrDeleteDuplications too slow when using hadoop
 -

 Key: NUTCH-739
 URL: https://issues.apache.org/jira/browse/NUTCH-739
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: hadoop cluster with 3 nodes
 Map Task Capacity: 6
 Reduce Task Capacity: 6
 Indexer: one instance of solr server (on the one of slave nodes)
Reporter: Dmitry Lihachev
 Fix For: 1.1

 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch


 in my environment i always have many warnings like this on the dedup step
 {noformat}
 Task attempt_200905270022_0212_r_03_0 failed to report status for 600 
 seconds. Killing!
 {noformat}
 solr logs:
 {noformat}
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173741
 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: {optimize=} 0 173599
 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173599
 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing searc...@2ad9ac58 main
 May 27, 2009 10:29:27 AM 
 org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
 WARNING: Could not getStatistics on info bean 
 org.apache.solr.search.SolrIndexSearcher
 org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
 
 {noformat}
 So I think the problem in the piece of code on line 301 of 
 SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
 each of ones tries to optimize solr indexes before closing.
 The simplest way to avoid this bug - removing this line and sending 
 optimize/ message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.