date:20080208

[jira] Commented: (LUCENE-1145) DisjunctionSumScorer small tweak

2008-02-08 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566961#action_12566961
 ] 

Eks Dev commented on LUCENE-1145:
-

test using Sun 1.4 jvm on the same hardware showed the same "a bit faster" 
behavior, so this is in my opinion OK to be committed.   

> DisjunctionSumScorer small tweak
> 
>
> Key: LUCENE-1145
> URL: https://issues.apache.org/jira/browse/LUCENE-1145
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
> Environment: all
>Reporter: Eks Dev
>Priority: Trivial
> Attachments: DisjunctionSumScorerOptimization.patch, 
> DSSQueueSizeOptimization.patch, TestScorerPerformance.java
>
>
> Move ScorerDocQueue initialization from next() and skipTo() methods to the 
> Constructor. Makes DisjunctionSumScorer a bit faster (less than 1% on my 
> tests). 
> Downside (if this is one, I cannot judge) would be throwing IOException from 
> DisjunctionSumScorer constructors as we touch HardDisk there. I see no 
> problem as this IOException does not propagate too far (the only modification 
> I made is in BooleanScorer2)
> if (scorerDocQueue == null) {
>   initScorerDocQueue();
> }
>  
> Attached test is just quick & dirty rip of  TestScorerPerf from standard 
> Lucene test package. Not included as patch as I do not like it.
> All test pass, patch made on trunk revision 613923

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: detected corrupted index / performance improvement

2008-02-08 Thread Michael McCandless



Mike, you're right: all lucene files are written sequentially
(flushing or merging).

It's just a matter of how many are open at once, and whether we are
also reading from source(s) files, which affects IO throughput far
less than truly random access writes.

Plus, as of LUCENE-843, bytes are written to tvx/tvd/tvf and fdx/fdt
"as we go", which is better because we get the bytes to the OS earlier
so it can properly schedule their arrival to stable storage.  So by
the time we flush a segment, the OS should have committed most of
those bytes.

When writing a segment, we write fnm, then open tii/tis/frq/prx at
once and write (sequentially) to them, then write to nrm.

Merging is far more IO intensive.  With mergeFactor=10, we read from
40 input streams and write to 4 output streams when merging the
tii/tis/frq/prx files.

Mike

Mike Klaas wrote:

Oh, it certainly causes some random access--I don't deny that.  I  
just want to emphasize that this isn't at all the same as all  
"random writes", which would be expected to perform an order-mag  
slower.


Just did a test where I wrote out a 1gig file in 1K chunks.  Then  
wrote it out in 2files, alternating 512 byte chunks, then 4 files/  
256 byte chunks.  Some speed is lost--perhaps 10% at each doubling-- 
but the speed is still essentially "sequential" speed.  You can get  
back the original performance by using consistent sized chunks (1K  
to each file round-robin).


HDD controllers are actually quite good at batching writes into  
sequentially.  Why else do you think sync() takes to long :)


-Mike

On 7-Feb-08, at 3:35 PM, robert engels wrote:


I don't think that is true - but I'm probably wrong though :).

My understanding is that several files are written in parallel  
(during the merge), causing random access. After the files are  
written, then they are all reread and written as a CFS file  
(essential sequential - although the read and write is going to  
cause head movement).


The code:

private IndexOutput tvx, tvf, tvd;  // To write term  
vectors

private FieldsWriter fieldsWriter;

is my clue that several files are written at once.

On Feb 7, 2008, at 5:19 PM, Mike Klaas wrote:



On 7-Feb-08, at 2:00 PM, robert engels wrote:

My point is that commit needs to be used in most applications,  
and the commit in Lucene is very slow.


You don't have 2x the IO cost, mainly because only the log file  
needs to be sync'd.  The index only has to be sync'd eventually,  
in order to prune the logfile - this can be done in the  
background, improving the performance of update and commit cycle.


Also, writing the log file is very efficiently because it is an  
append/sequential operation. Writing the segment files writes  
multiple files - essentially causing random access writes.


For large segments, multiple sequentially-written large files  
should perform similarly to one large sequentially-written file.   
It is only close to random access on the smallest segments (which  
a sufficiently-large flush-by-ram shouldn't produce).


-Mike


 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1169) Search with Filter does not work!

2008-02-08 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566971#action_12566971
 ] 

Eks Dev commented on LUCENE-1169:
-

Thank you for fixing it in no time :) But...

I am getting confused with skipping iterators semantics,

is this requirement for the other DocIdSetIterators, of only for scorers 
(should be, I guess)?

iterator.skipTo(iterator.doc()) <=> iterator.next();// is this contract?

if that is the case,  we have another bug in OpenBitSetIterator (border 
condition)

//this is the code in javadoc, "official contract"
  boolean simulatedSkipTo(DocIdSetIterator i, int target) throws IOException {
do {
  if (!i.next())
return false;
} while (target > i.doc());
return true;
  }

  public void testOpenBitSetBorderCondition() throws IOException {
OpenBitSet bs = new OpenBitSet();
bs.set(0);
DocIdSetIterator i = bs.iterator();

i.skipTo(i.doc());
assertEquals(0, i.doc()); //cool, moved to the first legal position

assertFalse("End of Matcher", i.skipTo(i.doc())); //NOT OK according to the 
javadoc
  }

  public void testOpenBitSetBorderConditionSimulated() throws IOException {
OpenBitSet bs = new OpenBitSet();
bs.set(0);
DocIdSetIterator i = bs.iterator();

simulatedSkipTo(i, i.doc());
assertEquals(0, i.doc()); //cool, moved to the first legal position

assertFalse("End of Matcher", simulatedSkipTo(i, i.doc())); //OK according 
to the javadoc!!
  }


> Search with Filter does not work!
> -
>
> Key: LUCENE-1169
> URL: https://issues.apache.org/jira/browse/LUCENE-1169
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Reporter: Eks Dev
>Assignee: Michael Busch
>Priority: Blocker
> Attachments: lucene-1169.patch, TestFilteredSearch.java
>
>
> See attached JUnitTest, self-explanatory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-1164) Improve how ConcurrentMergeScheduler handles too-many-merges case

2008-02-08 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1164.


Resolution: Fixed

> Improve how ConcurrentMergeScheduler handles too-many-merges case
> -
>
> Key: LUCENE-1164
> URL: https://issues.apache.org/jira/browse/LUCENE-1164
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1164.patch
>
>
> CMS now lets you set "maxMergeThreads" to control max # simultaneous
> merges.
> However, when CMS hits that max, it still allows further merges to
> run, by running them in the foreground thread.  So if you set this max
> to 1, and use 1 thread to add docs, you can get 2 merges running at
> once (which I think is broken).
> I think, instead, CMS should pause the foreground thread, waiting
> until the number of merge threads drops below the limit.  Then, kick
> off the backlog merge in a thread and return control back to primary
> thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: detected corrupted index / performance improvement

2008-02-08 Thread Doug Cutting


Michael McCandless wrote:

Merging is far more IO intensive.  With mergeFactor=10, we read from
40 input streams and write to 4 output streams when merging the
tii/tis/frq/prx files.


If your disk can transfer at 50MB/s, and takes 5ms/seek, then 250kB 
reads and writes are the break-even point, where half the time is spent 
seeking and half transferring, and throughput is 25MB/s.  With 44 files 
open, that means the OS needs just 11MB of buffering to keep things 
above this threshold.  Since most systems have considerably larger 
buffer pools than 11MB, merging with mergeFactor=10 shouldn't be seek-bound.


Doug


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1170) query with AND and OR not retrieving correct results

2008-02-08 Thread Graham Maloon (JIRA)

query with AND and OR not retrieving correct results


 Key: LUCENE-1170
 URL: https://issues.apache.org/jira/browse/LUCENE-1170
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Affects Versions: 2.3
 Environment: linux and windows
Reporter: Graham Maloon


I was working with Lucene 1.4, and have now upgraded to 2.3.0 but there is 
still a problem that I am experiencing with the Queryparser
 
I am passing the following queries:
 
"big brother" - works fine
"big brother" AND dubai - works fine
"big brother" AND football - works fine
"big brother" AND dubai OR football - returns extra documents which contain 
"big brother" but do not contain either dubai or football.
"big brother" AND (dubai OR football) gives the same as the one above  
 
Am I doing something wrong?


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1170) query with AND and OR not retrieving correct results

2008-02-08 Thread Daniel Naber (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567154#action_12567154
 ] 

Daniel Naber commented on LUCENE-1170:
--

It's a known problem with QueryParser, see e.g. LUCENE-167

> query with AND and OR not retrieving correct results
> 
>
> Key: LUCENE-1170
> URL: https://issues.apache.org/jira/browse/LUCENE-1170
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Affects Versions: 2.3
> Environment: linux and windows
>Reporter: Graham Maloon
>
> I was working with Lucene 1.4, and have now upgraded to 2.3.0 but there is 
> still a problem that I am experiencing with the Queryparser
>  
> I am passing the following queries:
>  
> "big brother" - works fine
> "big brother" AND dubai - works fine
> "big brother" AND football - works fine
> "big brother" AND dubai OR football - returns extra documents which contain 
> "big brother" but do not contain either dubai or football.
> "big brother" AND (dubai OR football) gives the same as the one above  
>  
> Am I doing something wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: detected corrupted index / performance improvement

2008-02-08 Thread Doug Cutting


robert engels wrote:
But that would mean we should be using at least 250k buffers for the 
IndexInput ? Not the 16k or so that is the default.


Is the OS smart enough to figure out that the file is being sequentially 
read, and adjust its physical read size to 256k, based on the other 
concurrent IO operations. Seems this would be hard for it to figure out, 
and have it not perform poorly in the general case.


Benchmarks have shown that OSes do a decent job at this.  You can 
increase the applications buffer sizes, but you might just end up 
wasting memory if the OS is already doing the right thing.  The linux 
kernel dynamically increases the readahead window based on the access 
pattern: the more you read sequentially, the larger the readahead window.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1044) Behavior on hard power shutdown

2008-02-08 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1044:
---

Attachment: LUCENE-1044.take8.patch

Attached new rev of the patch.  Only changes were to add caveats in javadcos 
about IO devices that ignore fsync, and, updated patch to apply cleanly on 
current trunk.

I plan to commit in a day or two.

> Behavior on hard power shutdown
> ---
>
> Key: LUCENE-1044
> URL: https://issues.apache.org/jira/browse/LUCENE-1044
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
> Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java 
> 1.5
>Reporter: venkat rangan
>Assignee: Michael McCandless
> Fix For: 2.4
>
> Attachments: FSyncPerfTest.java, LUCENE-1044.patch, 
> LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch, 
> LUCENE-1044.take5.patch, LUCENE-1044.take6.patch, LUCENE-1044.take7.patch, 
> LUCENE-1044.take8.patch
>
>
> When indexing a large number of documents, upon a hard power failure  (e.g. 
> pull the power cord), the index seems to get corrupted. We start a Java 
> application as an Windows Service, and feed it documents. In some cases 
> (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the 
> following is observed.
> The 'segments' file contains only zeros. Its size is 265 bytes - all bytes 
> are zeros.
> The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes 
> are zeros.
> Before corruption, the segments file and deleted file appear to be correct. 
> After this corruption, the index is corrupted and lost.
> This is a problem observed in Lucene 1.4.3. We are not able to upgrade our 
> customer deployments to 1.9 or later version, but would be happy to back-port 
> a patch, if the patch is small enough and if this problem is already solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1169) Search with Filter does not work!

2008-02-08 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567115#action_12567115
 ] 

Doug Cutting commented on LUCENE-1169:
--

> iterator.skipTo(iterator.doc()) <=> iterator.next();// is this contract?

Yes.  The reason is that TermDocs#doc() cannot be called when a TermDocs is 
first created, since it is then positioned before the first entry.  One must 
call next() at least once before first calling doc().  If the TermDocs is 
empty, then doc() should never be called.  Consider the case of passing an 
empty TermDocs to skipTo(int): the call to next must be made, so that 'false' 
is returned without ever calling doc().

There are other ways of doing this, like defining that doc() returns -1 before 
next() has ever been called, and/or returning Integer.MAX_VALUE after the last 
document.  But, for better or worse, that's not the design that was chosen.


> Search with Filter does not work!
> -
>
> Key: LUCENE-1169
> URL: https://issues.apache.org/jira/browse/LUCENE-1169
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Reporter: Eks Dev
>Assignee: Michael Busch
>Priority: Blocker
> Attachments: lucene-1169.patch, TestFilteredSearch.java
>
>
> See attached JUnitTest, self-explanatory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1171) Make DocumentsWriter more robust on hitting OOM

2008-02-08 Thread Michael McCandless (JIRA)

Make DocumentsWriter more robust on hitting OOM
---

 Key: LUCENE-1171
 URL: https://issues.apache.org/jira/browse/LUCENE-1171
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.4


I've been stress testing DocumentsWriter by indexing wikipedia, but not
giving enough memory to the JVM, in varying heap sizes to tickle the
different interesting cases.  Sometimes DocumentsWriter can deadlock;
other times it will hit a subsequent NPE or AIOOBE or assertion
failure.

I've fixed all the cases I've found, and added some more asserts.  Now
it just produces plain OOM exceptions.  All changes are contained to
DocumentsWriter.java.

All tests pass.  I plan to commit in a day or two!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1171) Make DocumentsWriter more robust on hitting OOM

2008-02-08 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1171:
---

Attachment: LUCENE-1171.patch

Attached patch.

> Make DocumentsWriter more robust on hitting OOM
> ---
>
> Key: LUCENE-1171
> URL: https://issues.apache.org/jira/browse/LUCENE-1171
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1171.patch
>
>
> I've been stress testing DocumentsWriter by indexing wikipedia, but not
> giving enough memory to the JVM, in varying heap sizes to tickle the
> different interesting cases.  Sometimes DocumentsWriter can deadlock;
> other times it will hit a subsequent NPE or AIOOBE or assertion
> failure.
> I've fixed all the cases I've found, and added some more asserts.  Now
> it just produces plain OOM exceptions.  All changes are contained to
> DocumentsWriter.java.
> All tests pass.  I plan to commit in a day or two!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: detected corrupted index / performance improvement

2008-02-08 Thread Doug Cutting


Doug Cutting wrote:
The linux 
kernel dynamically increases the readahead window based on the access 
pattern: the more you read sequentially, the larger the readahead window.


Sorry, it appears that's in 2.6.23, which isn't yet broadly used.

http://kernelnewbies.org/Linux_2_6_23#head-102af265937262a7a21766ae58fddc1a29a5d8d7

In the meantime, on Linux, one can set both the kernel's readahead 
buffer size and the device's.  These are additive: the first determines 
what requests will be made to the device, the second determines how much 
beyond that the device will attempt to read.


# set kernel read-ahead buffer to 1MB
echo 1024 > /sys/block/sda/queue/read_ahead_kb

# set device read-ahead buffer to 1024 sectors
hdparm -a1024 /dev/sda1

I don't know how much these actually help things...

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: detected corrupted index / performance improvement

2008-02-08 Thread robert engels

But that would mean we should be using at least 250k buffers for the  
IndexInput ? Not the 16k or so that is the default.


Is the OS smart enough to figure out that the file is being  
sequentially read, and adjust its physical read size to 256k, based  
on the other concurrent IO operations. Seems this would be hard for  
it to figure out, and have it not perform poorly in the general case.


On Feb 8, 2008, at 11:25 AM, Doug Cutting wrote:


Michael McCandless wrote:

Merging is far more IO intensive.  With mergeFactor=10, we read from
40 input streams and write to 4 output streams when merging the
tii/tis/frq/prx files.


If your disk can transfer at 50MB/s, and takes 5ms/seek, then 250kB  
reads and writes are the break-even point, where half the time is  
spent seeking and half transferring, and throughput is 25MB/s.   
With 44 files open, that means the OS needs just 11MB of buffering  
to keep things above this threshold.  Since most systems have  
considerably larger buffer pools than 11MB, merging with  
mergeFactor=10 shouldn't be seek-bound.


Doug


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

2008-02-08 Thread Steven Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567236#action_12567236
 ] 

Steven Rowe commented on LUCENE-1157:
-

Excellent, the link from the Developer Resources page now works!

Doron, I noticed that when you initially committed this, you added an entry to 
CHANGES.txt, in the "Documentation" section, but it is no longer there.

> Formatable changes log  (CHANGES.txt is easy to edit but not so friendly to 
> read by Lucene users)
> -
>
> Key: LUCENE-1157
> URL: https://issues.apache.org/jira/browse/LUCENE-1157
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Website
>Reporter: Doron Cohen
>Assignee: Doron Cohen
> Fix For: 2.4
>
> Attachments: lucene-1157-take2.patch, lucene-1157-take3.patch, 
> lucene-1157.patch
>
>
> Background in http://www.nabble.com/formatable-changes-log-tt15078749.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter

2008-02-08 Thread Michael McCandless (JIRA)

Small speedups to DocumentsWriter
-

 Key: LUCENE-1172
 URL: https://issues.apache.org/jira/browse/LUCENE-1172
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.3
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.4
 Attachments: LUCENE-1172.patch

Some small fixes that I found while profiling indexing Wikipedia,
mainly using our own quickSort instead of Arrays.sort.

Testing first 200K docs of Wikipedia shows a speedup from 274.6
seconds to 270.2 seconds.

I'll commit in a day or two.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1172) Small speedups to DocumentsWriter

2008-02-08 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1172:
---

Attachment: LUCENE-1172.patch

> Small speedups to DocumentsWriter
> -
>
> Key: LUCENE-1172
> URL: https://issues.apache.org/jira/browse/LUCENE-1172
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1172.patch
>
>
> Some small fixes that I found while profiling indexing Wikipedia,
> mainly using our own quickSort instead of Arrays.sort.
> Testing first 200K docs of Wikipedia shows a speedup from 274.6
> seconds to 270.2 seconds.
> I'll commit in a day or two.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter

2008-02-08 Thread robert engels

Curious... on things like this, is it really worth adding (and  
maintaining) Lucene's own sort, just to achieve a 1.5 % performance  
increase. It is almost doubtful that you can even measure an  
improvement at that level, given all of the variables you can't control.


I see a LOT of code in Lucene that is very obtuse - mainly to gain  
VERY small performance benefits.


Isn't there a compelling case to not worry about this stuff, and let  
the JVM people figure it out, and concentrate on writing clear, easy  
to understand code.


I think we are better off looking for data structure or algorithm  
changes - these micro-improvements just lead to code bloat, and  
maintenance headaches. I also think it is doubtful that future JVM  
generations won't do them automatically anyway, any hand optimizing  
might actually reduce performance.



On Feb 8, 2008, at 6:52 PM, Michael McCandless (JIRA) wrote:


Small speedups to DocumentsWriter
-

 Key: LUCENE-1172
 URL: https://issues.apache.org/jira/browse/ 
LUCENE-1172

 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.3
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.4
 Attachments: LUCENE-1172.patch

Some small fixes that I found while profiling indexing Wikipedia,
mainly using our own quickSort instead of Arrays.sort.

Testing first 200K docs of Wikipedia shows a speedup from 274.6
seconds to 270.2 seconds.

I'll commit in a day or two.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1145) DisjunctionSumScorer small tweak

Re: detected corrupted index / performance improvement

[jira] Commented: (LUCENE-1169) Search with Filter does not work!

[jira] Resolved: (LUCENE-1164) Improve how ConcurrentMergeScheduler handles too-many-merges case

Re: detected corrupted index / performance improvement

[jira] Created: (LUCENE-1170) query with AND and OR not retrieving correct results

[jira] Commented: (LUCENE-1170) query with AND and OR not retrieving correct results

Re: detected corrupted index / performance improvement

[jira] Updated: (LUCENE-1044) Behavior on hard power shutdown

[jira] Commented: (LUCENE-1169) Search with Filter does not work!

[jira] Created: (LUCENE-1171) Make DocumentsWriter more robust on hitting OOM

[jira] Updated: (LUCENE-1171) Make DocumentsWriter more robust on hitting OOM

Re: detected corrupted index / performance improvement

Re: detected corrupted index / performance improvement

[jira] Commented: (LUCENE-1157) Formatable changes log (CHANGES.txt is easy to edit but not so friendly to read by Lucene users)

[jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter

[jira] Updated: (LUCENE-1172) Small speedups to DocumentsWriter

Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter

18 matches

Site Navigation

Mail list logo

Footer information