[jira] [Commented] (NUTCH-2179) Cleanup job for SOLR Performance Boost

2018-02-13 Thread David Johnson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16362912#comment-16362912
 ] 

David Johnson commented on NUTCH-2179:
--

Appears to have been rsolved by NUTCH-2197, which introduced a similar delete 
queue

> Cleanup job for SOLR Performance Boost
> --
>
> Key: NUTCH-2179
> URL: https://issues.apache.org/jira/browse/NUTCH-2179
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.9, 1.10, 1.11
>Reporter: David Johnson
>Priority: Minor
>  Labels: patch
> Attachments: 0001-Create-delete-queue.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> During a cleanup job, index deletes are scheduled one by one, which can make 
> a large job take days



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2179) Cleanup job for SOLR Performance Boost

2015-12-01 Thread David Johnson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034541#comment-15034541
 ] 

David Johnson commented on NUTCH-2179:
--

Correct, it should be, but is currently writing them singleton.

I'm proposing we swap the current delete function

public void delete(String key) throws IOException {
if (delete) {
  try {
solr.deleteById(key);  //singleton delete
numDeletes++;
  } catch (final SolrServerException e) {
throw makeIOException(e);
  }
}
  }

with

public void delete(String key) throws IOException {
if (delete) {
deleteURLs.add(key);
if (inputDocs.size() + deleteURLs.size() >= batchSize) {
sendRequest();
}
}
}


and abstract the actual send request with a function that can be called from 
write, delete and close.

I have a patch prepared, but I cannot get the branch to commit - I'll attempt 
again later.

> Cleanup job for SOLR Performance Boost
> --
>
> Key: NUTCH-2179
> URL: https://issues.apache.org/jira/browse/NUTCH-2179
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.9, 1.10, 1.11
>Reporter: David Johnson
>Priority: Minor
>  Labels: patch
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> During a cleanup job, index deletes are scheduled one by one, which can make 
> a large job take days



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2179) Cleanup job for SOLR Performance Boost

2015-12-01 Thread David Johnson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034542#comment-15034542
 ] 

David Johnson commented on NUTCH-2179:
--

Correct, it should be, but is currently writing them singleton.

I'm proposing we swap the current delete function

public void delete(String key) throws IOException {
if (delete) {
  try {
solr.deleteById(key);  //singleton delete
numDeletes++;
  } catch (final SolrServerException e) {
throw makeIOException(e);
  }
}
  }

with

public void delete(String key) throws IOException {
if (delete) {
deleteURLs.add(key);
if (inputDocs.size() + deleteURLs.size() >= batchSize) {
sendRequest();
}
}
}


and abstract the actual send request with a function that can be called from 
write, delete and close.

I have a patch prepared, but I cannot get the branch to commit - I'll attempt 
again later.

> Cleanup job for SOLR Performance Boost
> --
>
> Key: NUTCH-2179
> URL: https://issues.apache.org/jira/browse/NUTCH-2179
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.9, 1.10, 1.11
>Reporter: David Johnson
>Priority: Minor
>  Labels: patch
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> During a cleanup job, index deletes are scheduled one by one, which can make 
> a large job take days



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2179) Cleanup job for SOLR Performance Boost

2015-12-01 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034511#comment-15034511
 ] 

Sebastian Nagel commented on NUTCH-2179:


+1: SolrIndexWriter should queue the deletions the same way as done for 
additions/updates. Looks like the bulk commit by an UpdateRequest is already 
assumed because numDeletes is taken into account when checking whether the 
batchSize is reached (SolrIndexWriter, line 125: {{if (inputDocs.size() + 
numDeletes >= batchSize) {}}

> Cleanup job for SOLR Performance Boost
> --
>
> Key: NUTCH-2179
> URL: https://issues.apache.org/jira/browse/NUTCH-2179
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.9, 1.10, 1.11
>Reporter: David Johnson
>Priority: Minor
>  Labels: patch
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> During a cleanup job, index deletes are scheduled one by one, which can make 
> a large job take days



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)