[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog

2013-04-19 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13636727#comment-13636727
 ] 

Shawn Heisey commented on SOLR-3954:


The experimentation mentioned in my last comment was a success.  There is still 
a performance impact, but it is smaller, and tlog sizes are under control.  I 
still think a fix for this issue would be a good idea for general performance 
reasons, especially with DIH full-import.


 Option to have updateHandler and DIH skip updateLog
 ---

 Key: SOLR-3954
 URL: https://issues.apache.org/jira/browse/SOLR-3954
 Project: Solr
  Issue Type: Improvement
  Components: update
Affects Versions: 4.0
Reporter: Shawn Heisey
 Fix For: 4.3


 The updateLog feature makes updates take longer, likely because of the I/O 
 time required to write the additional information to disk.  It may take as 
 much as three times as long for the indexing portion of the process.  I'm not 
 sure whether it affects the time to commit, but I would imagine that the 
 difference there is small or zero.  When doing incremental updates/deletes on 
 an existing index, the time lag is probably very small and unimportant.
 When doing a full reindex (which may happen via DIH), especially if this is 
 done in a build core that is then swapped with a live core, this performance 
 hit is unacceptable.  It seems to make the import take about three times as 
 long.
 An option to have an update skip the updateLog would be very useful for these 
 situations.  It should have a method in SolrJ and be exposed in DIH as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog

2012-12-28 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540761#comment-13540761
 ] 

Shawn Heisey commented on SOLR-3954:


I am currently experimenting with updateLog turned on full time and using 
autoCommit to keep the size of the tlog directory under control.  Unconfirmed 
testing suggests that the overall slowdown using this method is not as extreme 
as it it is when my entire dataimport happens without commits.

It's still my opinion that a fix for this issue would be a good idea, but I do 
not think it should hold up the 4.1 release.


 Option to have updateHandler and DIH skip updateLog
 ---

 Key: SOLR-3954
 URL: https://issues.apache.org/jira/browse/SOLR-3954
 Project: Solr
  Issue Type: Improvement
  Components: update
Affects Versions: 4.0
Reporter: Shawn Heisey
 Fix For: 4.1


 The updateLog feature makes updates take longer, likely because of the I/O 
 time required to write the additional information to disk.  It may take as 
 much as three times as long for the indexing portion of the process.  I'm not 
 sure whether it affects the time to commit, but I would imagine that the 
 difference there is small or zero.  When doing incremental updates/deletes on 
 an existing index, the time lag is probably very small and unimportant.
 When doing a full reindex (which may happen via DIH), especially if this is 
 done in a build core that is then swapped with a live core, this performance 
 hit is unacceptable.  It seems to make the import take about three times as 
 long.
 An option to have an update skip the updateLog would be very useful for these 
 situations.  It should have a method in SolrJ and be exposed in DIH as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog

2012-10-16 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477106#comment-13477106
 ] 

Shawn Heisey commented on SOLR-3954:


I was unsure what to put for the priority.  Minor seems slightly too low and 
Major seems too high.

 Option to have updateHandler and DIH skip updateLog
 ---

 Key: SOLR-3954
 URL: https://issues.apache.org/jira/browse/SOLR-3954
 Project: Solr
  Issue Type: Improvement
  Components: update
Affects Versions: 4.0
Reporter: Shawn Heisey
 Fix For: 4.1


 The updateLog feature makes updates take longer, likely because of the I/O 
 time required to write the additional information to disk.  It may take as 
 much as three times as long for the indexing portion of the process.  I'm not 
 sure whether it affects the time to commit, but I would imagine that the 
 difference there is small or zero.  When doing incremental updates/deletes on 
 an existing index, the time lag is probably very small and unimportant.
 When doing a full reindex (which may happen via DIH), especially if this is 
 done in a build core that is then swapped with a live core, this performance 
 hit is unacceptable.  It seems to make the import take about three times as 
 long.
 An option to have an update skip the updateLog would be very useful for these 
 situations.  It should have a method in SolrJ and be exposed in DIH as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog

2012-10-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477196#comment-13477196
 ] 

Mark Miller commented on SOLR-3954:
---

What config are you using? The updatelog should not normally have this kind of 
performance penalty.

In any case, I don't think we would add an option to skip the update log - you 
can remove it if the performance is unacceptable.

 Option to have updateHandler and DIH skip updateLog
 ---

 Key: SOLR-3954
 URL: https://issues.apache.org/jira/browse/SOLR-3954
 Project: Solr
  Issue Type: Improvement
  Components: update
Affects Versions: 4.0
Reporter: Shawn Heisey
 Fix For: 4.1


 The updateLog feature makes updates take longer, likely because of the I/O 
 time required to write the additional information to disk.  It may take as 
 much as three times as long for the indexing portion of the process.  I'm not 
 sure whether it affects the time to commit, but I would imagine that the 
 difference there is small or zero.  When doing incremental updates/deletes on 
 an existing index, the time lag is probably very small and unimportant.
 When doing a full reindex (which may happen via DIH), especially if this is 
 done in a build core that is then swapped with a live core, this performance 
 hit is unacceptable.  It seems to make the import take about three times as 
 long.
 An option to have an update skip the updateLog would be very useful for these 
 situations.  It should have a method in SolrJ and be exposed in DIH as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog

2012-10-16 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477283#comment-13477283
 ] 

Shawn Heisey commented on SOLR-3954:


Which specific configuration bits would you like to see?  My solrconfig.xml 
file is heavily split into separate files and uses xinclude.  I will go ahead 
and paste my best guesses now.

{code}
directoryFactory name=DirectoryFactory 
class=${solr.directoryFactory:solr.NRTCachingDirectoryFactory}/

indexDefaults
  useCompoundFilefalse/useCompoundFile
  mergePolicy class=org.apache.lucene.index.TieredMergePolicy
int name=maxMergeAtOnce35/int
int name=segmentsPerTier35/int
int name=maxMergeAtOnceExplicit105/int
  /mergePolicy
  mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler
int name=maxMergeCount4/int
int name=maxThreadCount4/int
  /mergeScheduler
  ramBufferSizeMB128/ramBufferSizeMB
  maxFieldLength32768/maxFieldLength
  writeLockTimeout1000/writeLockTimeout
  commitLockTimeout1/commitLockTimeout
  lockTypenative/lockType
/indexDefaults

updateHandler class=solr.DirectUpdateHandler2
  autoCommit
maxDocs0/maxDocs
maxTime0/maxTime
  /autoCommit
!--
  updateLog /
--
/updateHandler
{code}

My schema has 47 fields defined.  Not all fields in a typical document will be 
there, but at least half of them usually will be present.  I use the ICU 
classes for lowercasing and most of the text fieldTypes are using 
WordDelimeterFilter.

{code}
  fields
   field name=catchall type=genText indexed=true stored=false 
multiValued=true termVectors=true/
   field name=doc_date type=tdate indexed=true stored=true/
   field name=pd type=tdate indexed=true stored=true/
   field name=ft_text type=ignored/
   field name=mime_type type=mimeText indexed=true stored=true 
omitTermFreqAndPositions=true/
   field name=ft_dname type=genText indexed=true stored=true/
   field name=ft_subject type=genText indexed=true stored=true/
   field name=action type=keyText indexed=true stored=true/
   field name=attribute type=keyText indexed=true stored=true 
omitTermFreqAndPositions=true/
   field name=category type=keyText indexed=true stored=true 
omitTermFreqAndPositions=true/
   field name=caption_writer type=keyText indexed=true stored=true/
   field name=doc_id type=keyText indexed=true stored=true/
   field name=ft_owner type=keyText indexed=true stored=true/
   field name=location type=keyText indexed=true stored=true/
   field name=special type=keyText indexed=true stored=true/
   field name=special_cats type=keyText indexed=true stored=true/
   field name=selector type=keyText indexed=true stored=true 
omitTermFreqAndPositions=true/
   field name=scode type=keyText indexed=true stored=true 
omitTermFreqAndPositions=true/
   field name=byline type=sourceText indexed=true stored=true/
   field name=credit type=sourceText indexed=true stored=false/
   field name=keywords type=sourceText indexed=true stored=true/
   field name=source type=sourceText indexed=true stored=true/
   field name=sg type=lcsemi indexed=true stored=false 
omitTermFreqAndPositions=true/
   field name=aimcode type=lowercase indexed=true stored=false 
omitTermFreqAndPositions=true/
   field name=nc_lang type=lowercase indexed=true stored=false 
omitTermFreqAndPositions=true/
   field name=tag_id type=lowercase indexed=true stored=true 
omitTermFreqAndPositions=true/
   field name=collection type=lowercase indexed=true stored=true 
omitTermFreqAndPositions=true/
   field name=feature type=lowercase indexed=true stored=true 
omitTermFreqAndPositions=true/
   field name=ip type=lowercase indexed=true stored=true 
omitTermFreqAndPositions=true/
   field name=longdim type=lowercase indexed=true stored=true 
omitTermFreqAndPositions=true/
   field name=webtable type=lowercase indexed=true stored=true 
omitTermFreqAndPositions=true/
   field name=set_name type=lowercase indexed=true stored=true 
omitTermFreqAndPositions=true/
   field name=did type=long indexed=true stored=true 
postingsFormat=BloomFilter/
   field name=doc_size type=long indexed=true stored=true/
   field name=post_date type=tlong indexed=true stored=true/
   field name=post_hour type=tlong indexed=true stored=true/
   field name=set_count type=int indexed=false stored=true/
   field name=set_lead type=boolean indexed=true stored=true 
default=true/
   field name=format type=string indexed=false stored=true/
   field name=ft_sfname type=string indexed=false stored=true/
   field name=text_preview type=string indexed=false stored=true/
   field name=_version_ type=long indexed=true stored=true/
   field name=headline type=keyText indexed=true stored=true/
   field name=mood type=keyText indexed=true stored=true/
   field name=object type=keyText indexed=true stored=true/
   field name=personality type=keyText indexed=true stored=true/
   field name=poster type=keyText indexed=true stored=true/
  /fields
  

[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog

2012-10-16 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477289#comment-13477289
 ] 

Shawn Heisey commented on SOLR-3954:


You'll notice that one field has postingsFormat.  This was for another bug that 
I filed.  It's not causing any difference in the config.  I will set up my 
import again so I can illustrate the performance impact from updateLog.


 Option to have updateHandler and DIH skip updateLog
 ---

 Key: SOLR-3954
 URL: https://issues.apache.org/jira/browse/SOLR-3954
 Project: Solr
  Issue Type: Improvement
  Components: update
Affects Versions: 4.0
Reporter: Shawn Heisey
 Fix For: 4.1


 The updateLog feature makes updates take longer, likely because of the I/O 
 time required to write the additional information to disk.  It may take as 
 much as three times as long for the indexing portion of the process.  I'm not 
 sure whether it affects the time to commit, but I would imagine that the 
 difference there is small or zero.  When doing incremental updates/deletes on 
 an existing index, the time lag is probably very small and unimportant.
 When doing a full reindex (which may happen via DIH), especially if this is 
 done in a build core that is then swapped with a live core, this performance 
 hit is unacceptable.  It seems to make the import take about three times as 
 long.
 An option to have an update skip the updateLog would be very useful for these 
 situations.  It should have a method in SolrJ and be exposed in DIH as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog

2012-10-16 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477293#comment-13477293
 ] 

Shawn Heisey commented on SOLR-3954:


This is my most intense fieldType definition:

{code}
fieldType name=genText class=solr.TextField sortMissingLast=true 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.PatternReplaceFilterFactory
  pattern=^(\p{Punct}*)(.*?)(\p{Punct}*)$
  replacement=$2
  allowempty=false
/
filter class=solr.WordDelimiterFilterFactory
  splitOnCaseChange=1
  splitOnNumerics=1
  stemEnglishPossessive=1
  generateWordParts=1
  generateNumberParts=1
  catenateWords=1
  catenateNumbers=1
  catenateAll=0
  preserveOriginal=1
/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.LengthFilterFactory min=1 max=512/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.PatternReplaceFilterFactory
  pattern=^(\p{Punct}*)(.*?)(\p{Punct}*)$
  replacement=$2
  allowempty=false
/
filter class=solr.WordDelimiterFilterFactory
  splitOnCaseChange=1
  splitOnNumerics=1
  stemEnglishPossessive=1
  generateWordParts=1
  generateNumberParts=1
  catenateWords=0
  catenateNumbers=0
  catenateAll=0
  preserveOriginal=1
/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.LengthFilterFactory min=1 max=512/
  /analyzer
/fieldType
{code}


 Option to have updateHandler and DIH skip updateLog
 ---

 Key: SOLR-3954
 URL: https://issues.apache.org/jira/browse/SOLR-3954
 Project: Solr
  Issue Type: Improvement
  Components: update
Affects Versions: 4.0
Reporter: Shawn Heisey
 Fix For: 4.1


 The updateLog feature makes updates take longer, likely because of the I/O 
 time required to write the additional information to disk.  It may take as 
 much as three times as long for the indexing portion of the process.  I'm not 
 sure whether it affects the time to commit, but I would imagine that the 
 difference there is small or zero.  When doing incremental updates/deletes on 
 an existing index, the time lag is probably very small and unimportant.
 When doing a full reindex (which may happen via DIH), especially if this is 
 done in a build core that is then swapped with a live core, this performance 
 hit is unacceptable.  It seems to make the import take about three times as 
 long.
 An option to have an update skip the updateLog would be very useful for these 
 situations.  It should have a method in SolrJ and be exposed in DIH as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog

2012-10-16 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477326#comment-13477326
 ] 

Shawn Heisey commented on SOLR-3954:


A completed import with updateLog turned off:

{code}
?xml version=1.0 encoding=UTF-8?
response

lst name=responseHeader
  int name=status0/int
  int name=QTime0/int
/lst
lst name=initArgs
  lst name=defaults
str name=configdih-config.xml/str
  /lst
/lst
str name=statusidle/str
str name=importResponse/
lst name=statusMessages
  str name=Total Requests made to DataSource1/str
  str name=Total Rows Fetched12947488/str
  str name=Total Documents Skipped0/str
  str name=Full Dump Started2012-10-16 07:46:01/str
  str name=Indexing completed. Added/Updated: 12947488 documents. Deleted 0 
documents./str
  str name=Committed2012-10-16 11:17:48/str
  str name=Total Documents Processed12947488/str
  str name=Time taken3:31:47.508/str
/lst
str name=WARNINGThis response format is experimental.  It is likely to 
change in the future./str
/response
{code}


 Option to have updateHandler and DIH skip updateLog
 ---

 Key: SOLR-3954
 URL: https://issues.apache.org/jira/browse/SOLR-3954
 Project: Solr
  Issue Type: Improvement
  Components: update
Affects Versions: 4.0
Reporter: Shawn Heisey
 Fix For: 4.1


 The updateLog feature makes updates take longer, likely because of the I/O 
 time required to write the additional information to disk.  It may take as 
 much as three times as long for the indexing portion of the process.  I'm not 
 sure whether it affects the time to commit, but I would imagine that the 
 difference there is small or zero.  When doing incremental updates/deletes on 
 an existing index, the time lag is probably very small and unimportant.
 When doing a full reindex (which may happen via DIH), especially if this is 
 done in a build core that is then swapped with a live core, this performance 
 hit is unacceptable.  It seems to make the import take about three times as 
 long.
 An option to have an update skip the updateLog would be very useful for these 
 situations.  It should have a method in SolrJ and be exposed in DIH as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog

2012-10-16 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477333#comment-13477333
 ] 

David Smiley commented on SOLR-3954:


FWIW I've seen the updateLog grow to huge sizes for my bulk import.  I commit 
at the end (of course) no soft commits or auto commits in-between.  The 
updateLog is a hinderance during bulk imports.

 Option to have updateHandler and DIH skip updateLog
 ---

 Key: SOLR-3954
 URL: https://issues.apache.org/jira/browse/SOLR-3954
 Project: Solr
  Issue Type: Improvement
  Components: update
Affects Versions: 4.0
Reporter: Shawn Heisey
 Fix For: 4.1


 The updateLog feature makes updates take longer, likely because of the I/O 
 time required to write the additional information to disk.  It may take as 
 much as three times as long for the indexing portion of the process.  I'm not 
 sure whether it affects the time to commit, but I would imagine that the 
 difference there is small or zero.  When doing incremental updates/deletes on 
 an existing index, the time lag is probably very small and unimportant.
 When doing a full reindex (which may happen via DIH), especially if this is 
 done in a build core that is then swapped with a live core, this performance 
 hit is unacceptable.  It seems to make the import take about three times as 
 long.
 An option to have an update skip the updateLog would be very useful for these 
 situations.  It should have a method in SolrJ and be exposed in DIH as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog

2012-10-16 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477445#comment-13477445
 ] 

Shawn Heisey commented on SOLR-3954:


Here's a direct comparison on the same hardware.  It might be important to know 
that when my import gets kicked off, there are actually four imports running.  
One of them is small -- during the second test (updateLog off), it imported 
687765 rows in 10 minutes and 08 seconds.  I did not check how long it took 
during the first test.  The other three imports are all nearly 13 million 
records each.

A du on the completed index directory with 12.9 million records shows 23520900 
KB.

I ran the first test and grabbed stats after an hour.  Then I killed Solr, 
commented out updateLog, started it up again, kicked off the full-import, and 
again grabbed stats after an hour.  Comparing the two shows that it is about 
twice as fast with updateLog turned off.

With updateLog turned on:

{code}
?xml version=1.0 encoding=UTF-8?
response

lst name=responseHeader
  int name=status0/int
  int name=QTime0/int
/lst
lst name=initArgs
  lst name=defaults
str name=configdih-config.xml/str
  /lst
/lst
str name=statusbusy/str
str name=importResponseA command is still running.../str
lst name=statusMessages
  str name=Time Elapsed1:0:1.762/str
  str name=Total Requests made to DataSource1/str
  str name=Total Rows Fetched2052096/str
  str name=Total Documents Processed2052095/str
  str name=Total Documents Skipped0/str
  str name=Full Dump Started2012-10-16 14:59:01/str
/lst
str name=WARNINGThis response format is experimental.  It is likely to 
change in the future./str
/response
{code}

With updateLog turned off:

{code}
?xml version=1.0 encoding=UTF-8?
response

lst name=responseHeader
  int name=status0/int
  int name=QTime0/int
/lst
lst name=initArgs
  lst name=defaults
str name=configdih-config.xml/str
  /lst
/lst
str name=statusbusy/str
str name=importResponseA command is still running.../str
lst name=statusMessages
  str name=Time Elapsed1:0:0.434/str
  str name=Total Requests made to DataSource1/str
  str name=Total Rows Fetched4167525/str
  str name=Total Documents Processed4167524/str
  str name=Total Documents Skipped0/str
  str name=Full Dump Started2012-10-16 16:05:01/str
/lst
str name=WARNINGThis response format is experimental.  It is likely to 
change in the future./str
/response
{code}


 Option to have updateHandler and DIH skip updateLog
 ---

 Key: SOLR-3954
 URL: https://issues.apache.org/jira/browse/SOLR-3954
 Project: Solr
  Issue Type: Improvement
  Components: update
Affects Versions: 4.0
Reporter: Shawn Heisey
 Fix For: 4.1


 The updateLog feature makes updates take longer, likely because of the I/O 
 time required to write the additional information to disk.  It may take as 
 much as three times as long for the indexing portion of the process.  I'm not 
 sure whether it affects the time to commit, but I would imagine that the 
 difference there is small or zero.  When doing incremental updates/deletes on 
 an existing index, the time lag is probably very small and unimportant.
 When doing a full reindex (which may happen via DIH), especially if this is 
 done in a build core that is then swapped with a live core, this performance 
 hit is unacceptable.  It seems to make the import take about three times as 
 long.
 An option to have an update skip the updateLog would be very useful for these 
 situations.  It should have a method in SolrJ and be exposed in DIH as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog

2012-10-16 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477462#comment-13477462
 ] 

Shawn Heisey commented on SOLR-3954:


bq. In any case, I don't think we would add an option to skip the update log - 
you can remove it if the performance is unacceptable.

When I revamp my SolrJ application, I plan to use soft commit on a very short 
interval (maybe 10 seconds) but only do a hard commit every five minutes, 
possibly even less often.

If I understand the updateLog functionality right, and I don't claim that I do, 
it would mean that my SolrJ code would not need to keep separate track of which 
updates succeeded with soft commit and which ones succeeded with hard commit.  
If the server went down four minutes and 55 seconds after the last hard commit, 
I would have reasonable expectation that when it came back up, all those soft 
commits would get properly applied to my index.

Assuming I have a proper understanding above, I want the updateLog for my 
incremental updates.  It makes the bulk import take at least twice as long, and 
I do not need it there because if that fails, I will just start it over.  If I 
am going to benefit from updateLog, I need to be able to turn it off for bulk 
indexing.

Is there a way to create a second updateHandler that does not have updateLog 
enabled and tell DIH to use that handler?


 Option to have updateHandler and DIH skip updateLog
 ---

 Key: SOLR-3954
 URL: https://issues.apache.org/jira/browse/SOLR-3954
 Project: Solr
  Issue Type: Improvement
  Components: update
Affects Versions: 4.0
Reporter: Shawn Heisey
 Fix For: 4.1


 The updateLog feature makes updates take longer, likely because of the I/O 
 time required to write the additional information to disk.  It may take as 
 much as three times as long for the indexing portion of the process.  I'm not 
 sure whether it affects the time to commit, but I would imagine that the 
 difference there is small or zero.  When doing incremental updates/deletes on 
 an existing index, the time lag is probably very small and unimportant.
 When doing a full reindex (which may happen via DIH), especially if this is 
 done in a build core that is then swapped with a live core, this performance 
 hit is unacceptable.  It seems to make the import take about three times as 
 long.
 An option to have an update skip the updateLog would be very useful for these 
 situations.  It should have a method in SolrJ and be exposed in DIH as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org