[jira] Updated: (LUCENE-2111) Wrapup flexible indexing

2010-02-09 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2111:
---

Attachment: LUCENE-2111.patch

New flex patch attached:

  * I factored out separate public Multi* (Fields, Terms, etc.) from
DirectoryReader, DirectoryReader.  These classes merge multiple
flex "sub readers" into a single flex API on the fly.

  * Refactored all places that need to merge sub-readers to use this
API (DirectoryReader, MultiReader, SegmentMerger).  This is
cleaner because previously SegmentMerger had its own duplicated
code for doing this merging; now we have a single source for it
(though merging swaps in its own docs/positions enum, to remap
docIDs around deletions).

  * Changed the semantics of IndexReader.fields() -- for a multi
reader (any reader that consist of sequential sub readers),
fields() now throws UOE.

This is an important change with flex -- the caller now bears
responsibility for create a MultiFields if they really need it.

My thinking is that primary places in Lucene that consume postings now
operate per-segment, so a multi reader (Dir/MultiReader) should not
automatically "join up high" because it entails a hidden performance
hit.  So consumers that must access the flex API at the multi reader
level should be explicit about it...

However, to make this simple, I created a sugar static methods on
MultiFields (eg, MultiFields.getFields(IndexReader)) to easily do
this, and cutover places in Lucene that may need direct postings from
a multi-reader to use this method.

I've updated the javadocs explaining this.


> Wrapup flexible indexing
> 
>
> Key: LUCENE-2111
> URL: https://issues.apache.org/jira/browse/LUCENE-2111
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, 
> LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch
>
>
> Spinoff from LUCENE-1458.
> The flex branch is in fairly good shape -- all tests pass, initial search 
> performance testing looks good, it survived several visits from the Unicode 
> policeman ;)
> But it still has a number of nocommits, could use some more scrutiny 
> especially on the "emulate old API on flex index" and vice/versa code paths, 
> and still needs some more performance testing.  I'll do these under this 
> issue, and we should open separate issues for other self contained fixes.
> The end is in sight!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1769) Fix wrong clover analysis because of backwards-tests, upgrade clover to 2.6.3 or better

2010-02-09 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-1769.
---

Resolution: Fixed
  Assignee: Uwe Schindler

It seems to work, so closing this issue.

> Fix wrong clover analysis because of backwards-tests, upgrade clover to 2.6.3 
> or better
> ---
>
> Key: LUCENE-1769
> URL: https://issues.apache.org/jira/browse/LUCENE-1769
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Attachments: clover.license, clover.license, LUCENE-1769-2.patch, 
> LUCENE-1769.patch, LUCENE-1769.patch, LUCENE-1769.patch, LUCENE-1769.patch, 
> nicks-LUCENE-1769.patch
>
>
> This is a followup for 
> [http://www.lucidimagination.com/search/document/6248d6eafbe10ef4/build_failed_in_hudson_lucene_trunk_902]
> The problem with clover running on hudson is, that it does not instrument all 
> tests ran. The autodetection of clover 1.x is not able to find out which 
> files are the correct tests and only instruments the backwards test. Because 
> of this, the current coverage report is only from the backwards tests running 
> against the current Lucene JAR.
> You can see this, if you install clover and start the tests. During test-core 
> no clover data is added to the db, only when backwards-tests begin, new files 
> are created in the clover db folder.
> Clover 2.x supports a new ant task,  that can be used to specify 
> the files, that are the tests. It works here locally with clover 2.4.3 and 
> produces a really nice coverage report, also linking with test files work, it 
> tells which tests failed and so on.
> I will attach a patch, that changes common-build.xml to the new clover 
> version (other initialization resource) and tells clover where to find the 
> tests (using the test folder include/exclude properties).
> One problem with the current patch: It does *not* instrument the backwards 
> branch, so you see only coverage of the core/contrib tests. Getting the 
> coverage also from the backwards tests is not easy possible because of two 
> things:
> - the tag test dir is not easy to find out and add to  element 
> (there may be only one of them)
> - the test names in BW branch are identical to the trunk tests. This 
> completely corrupts the linkage between tests and code in the coverage report.
> In principle the best would be to generate a second coverage report for the 
> backwards branch with a separate clover DB. The attached patch does not 
> instrument the bw branch, it only does trunk tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1769) Fix wrong clover analysis because of backwards-tests, upgrade clover to 2.6.3 or better

2010-02-09 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1769:
--

Fix Version/s: 3.1

> Fix wrong clover analysis because of backwards-tests, upgrade clover to 2.6.3 
> or better
> ---
>
> Key: LUCENE-1769
> URL: https://issues.apache.org/jira/browse/LUCENE-1769
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: clover.license, clover.license, LUCENE-1769-2.patch, 
> LUCENE-1769.patch, LUCENE-1769.patch, LUCENE-1769.patch, LUCENE-1769.patch, 
> nicks-LUCENE-1769.patch
>
>
> This is a followup for 
> [http://www.lucidimagination.com/search/document/6248d6eafbe10ef4/build_failed_in_hudson_lucene_trunk_902]
> The problem with clover running on hudson is, that it does not instrument all 
> tests ran. The autodetection of clover 1.x is not able to find out which 
> files are the correct tests and only instruments the backwards test. Because 
> of this, the current coverage report is only from the backwards tests running 
> against the current Lucene JAR.
> You can see this, if you install clover and start the tests. During test-core 
> no clover data is added to the db, only when backwards-tests begin, new files 
> are created in the clover db folder.
> Clover 2.x supports a new ant task,  that can be used to specify 
> the files, that are the tests. It works here locally with clover 2.4.3 and 
> produces a really nice coverage report, also linking with test files work, it 
> tells which tests failed and so on.
> I will attach a patch, that changes common-build.xml to the new clover 
> version (other initialization resource) and tells clover where to find the 
> tests (using the test folder include/exclude properties).
> One problem with the current patch: It does *not* instrument the backwards 
> branch, so you see only coverage of the core/contrib tests. Getting the 
> coverage also from the backwards tests is not easy possible because of two 
> things:
> - the tag test dir is not easy to find out and add to  element 
> (there may be only one of them)
> - the test names in BW branch are identical to the trunk tests. This 
> completely corrupts the linkage between tests and code in the coverage report.
> In principle the best would be to generate a second coverage report for the 
> backwards branch with a separate clover DB. The attached patch does not 
> instrument the bw branch, it only does trunk tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Build failed in Hudson: Lucene-trunk #1088

2010-02-09 Thread Apache Hudson Server
See 

Changes:

[mikemccand] add assert to catch mismatched delete count on write; add detail 
to exception messages on corruption

--
[...truncated 9352 lines...]

common.init:

build-lucene:

build-lucene-tests:

init:

test:
 [echo] Building wikipedia...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

compile-test:
 [echo] Building wikipedia...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

clover.setup:

clover.info:

clover:

compile-core:

common.compile-test:

common.test:
[mkdir] Created dir: 

[junit] Testsuite: 
org.apache.lucene.wikipedia.analysis.WikipediaTokenizerTest
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 0.48 sec
[junit] 
   [delete] Deleting: 

 [echo] Building wordnet...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

test:
 [echo] Building wordnet...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

compile-test:
 [echo] Building wordnet...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

clover.setup:

clover.info:

clover:

compile-core:

common.compile-test:

common.test:
[mkdir] Created dir: 

[junit] Testsuite: org.apache.lucene.wordnet.TestSynonymTokenFilter
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.543 sec
[junit] 
[junit] Testsuite: org.apache.lucene.wordnet.TestWordnet
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.365 sec
[junit] 
[junit] - Standard Output ---
[junit] Opening Prolog file 

[junit] [1/2] Parsing 

[junit] 2 s(10001,1,'woods',n,1,0). 0 0 ndecent=0
[junit] 4 s(10001,3,'forest',n,1,0). 2 1 ndecent=0
[junit] 8 s(10003,2,'baron',n,1,1). 6 3 ndecent=0
[junit] [2/2] Building index to store synonyms,  map sizes are 8 and 4
[junit] row=1/8 doc= Document 
stored,indexed>
[junit] row=2/8 doc= Document 
stored,omitNorms stored,indexed>
[junit] row=4/8 doc= Document 
stored,indexed>
[junit] Optimizing..
[junit] Opening Prolog file 

[junit] [1/2] Parsing 

[junit] 2 s(10001,1,'woods',n,1,0). 0 0 ndecent=0
[junit] 4 s(10001,3,'forest',n,1,0). 2 1 ndecent=0
[junit] 8 s(10003,2,'baron',n,1,1). 6 3 ndecent=0
[junit] [2/2] Building index to store synonyms,  map sizes are 8 and 4
[junit] row=1/8 doc= Document 
stored,indexed>
[junit] row=2/8 doc= Document 
stored,omitNorms stored,indexed>
[junit] row=4/8 doc= Document 
stored,indexed>
[junit] Optimizing..
[junit] -  ---
   [delete] Deleting: 

 [echo] Building xml-query-parser...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

test:
 [echo] Building xml-query-parser...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

compile-test:
 [echo] Building xml-query-parser...

build-queries:

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

clover.setup:

clover.info:

clover:

common.compile-core:

compile-core:

common.compile-test:

common.test:
[mkdir] Created dir: 

[junit] Testsuite: org.apache.lucene.xmlparser.TestParser
[junit] Tests run: 18, Failures: 0, Error

RE: Build failed in Hudson: Lucene-trunk #1088

2010-02-09 Thread Uwe Schindler
The TestSpellChecker Executor problem seems to be a sun bug fixed in JDK 
1.5.0_17 (awaitTermination problem: 
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6576792 and related bugs). 
We updated lucene-zones's JVM for builds to the latest 1.5.0_22.

Thanks Mike!

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Apache Hudson Server [mailto:hud...@hudson.zones.apache.org]
> Sent: Tuesday, February 09, 2010 2:24 PM
> To: java-dev@lucene.apache.org
> Subject: Build failed in Hudson: Lucene-trunk #1088
> 
> See  trunk/1088/changes>
> 
> Changes:
> 
> [mikemccand] add assert to catch mismatched delete count on write; add
> detail to exception messages on corruption
> 
> --
> [...truncated 9352 lines...]
> 
> common.init:
> 
> build-lucene:
> 
> build-lucene-tests:
> 
> init:
> 
> test:
>  [echo] Building wikipedia...
> 
> javacc-uptodate-check:
> 
> javacc-notice:
> 
> jflex-uptodate-check:
> 
> jflex-notice:
> 
> common.init:
> 
> build-lucene:
> 
> build-lucene-tests:
> 
> init:
> 
> compile-test:
>  [echo] Building wikipedia...
> 
> javacc-uptodate-check:
> 
> javacc-notice:
> 
> jflex-uptodate-check:
> 
> jflex-notice:
> 
> common.init:
> 
> build-lucene:
> 
> build-lucene-tests:
> 
> init:
> 
> clover.setup:
> 
> clover.info:
> 
> clover:
> 
> compile-core:
> 
> common.compile-test:
> 
> common.test:
> [mkdir] Created dir:
>  trunk/ws/trunk/build/contrib/wikipedia/test>
> [junit] Testsuite:
> org.apache.lucene.wikipedia.analysis.WikipediaTokenizerTest
> [junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 0.48
> sec
> [junit]
>[delete] Deleting:
>  trunk/ws/trunk/build/contrib/wikipedia/test/junitfailed.flag>
>  [echo] Building wordnet...
> 
> javacc-uptodate-check:
> 
> javacc-notice:
> 
> jflex-uptodate-check:
> 
> jflex-notice:
> 
> common.init:
> 
> build-lucene:
> 
> build-lucene-tests:
> 
> init:
> 
> test:
>  [echo] Building wordnet...
> 
> javacc-uptodate-check:
> 
> javacc-notice:
> 
> jflex-uptodate-check:
> 
> jflex-notice:
> 
> common.init:
> 
> build-lucene:
> 
> build-lucene-tests:
> 
> init:
> 
> compile-test:
>  [echo] Building wordnet...
> 
> javacc-uptodate-check:
> 
> javacc-notice:
> 
> jflex-uptodate-check:
> 
> jflex-notice:
> 
> common.init:
> 
> build-lucene:
> 
> build-lucene-tests:
> 
> init:
> 
> clover.setup:
> 
> clover.info:
> 
> clover:
> 
> compile-core:
> 
> common.compile-test:
> 
> common.test:
> [mkdir] Created dir:
>  trunk/ws/trunk/build/contrib/wordnet/test>
> [junit] Testsuite: org.apache.lucene.wordnet.TestSynonymTokenFilter
> [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.543
> sec
> [junit]
> [junit] Testsuite: org.apache.lucene.wordnet.TestWordnet
> [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.365
> sec
> [junit]
> [junit] - Standard Output ---
> [junit] Opening Prolog file
>  trunk/ws/trunk/contrib/wordnet/src/test/org/apache/lucene/wordnet/testS
> ynonyms.txt>
> [junit] [1/2] Parsing
>  trunk/ws/trunk/contrib/wordnet/src/test/org/apache/lucene/wordnet/testS
> ynonyms.txt>
> [junit]   2 s(10001,1,'woods',n,1,0). 0 0 ndecent=0
> [junit]   4 s(10001,3,'forest',n,1,0). 2 1 ndecent=0
> [junit]   8 s(10003,2,'baron',n,1,1). 6 3 ndecent=0
> [junit] [2/2] Building index to store synonyms,  map sizes are 8
> and 4
> [junit]   row=1/8 doc= Document
> stored,indexed>
> [junit]   row=2/8 doc= Document
> stored,omitNorms stored,indexed>
> [junit]   row=4/8 doc= Document
> stored,indexed>
> [junit] Optimizing..
> [junit] Opening Prolog file
>  trunk/ws/trunk/contrib/wordnet/src/test/org/apache/lucene/wordnet/testS
> ynonyms.txt>
> [junit] [1/2] Parsing
>  trunk/ws/trunk/contrib/wordnet/src/test/org/apache/lucene/wordnet/testS
> ynonyms.txt>
> [junit]   2 s(10001,1,'woods',n,1,0). 0 0 ndecent=0
> [junit]   4 s(10001,3,'forest',n,1,0). 2 1 ndecent=0
> [junit]   8 s(10003,2,'baron',n,1,1). 6 3 ndecent=0
> [junit] [2/2] Building index to store synonyms,  map sizes are 8
> and 4
> [junit]   row=1/8 doc= Document
> stored,indexed>
> [junit]   row=2/8 doc= Document
> stored,omitNorms stored,indexed>
> [junit]   row=4/8 doc= Document
> stored,indexed>
> [junit] Optimizing..
> [junit] -  ---
>[delete] Deleting:
> 

[jira] Commented: (LUCENE-2154) Need a clean way for Dir/MultiReader to "merge" the AttributeSources of the sub-readers

2010-02-09 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831483#action_12831483
 ] 

Renaud Delbru commented on LUCENE-2154:
---

Sorry in advance, maybe what I am saying is out of scope due to my partial 
understanding of the problem.

I have start to look at the problem, in order to be able to use my own 
attributes from my own DocsAdnPositionsEnum classes.
would it not be simpler to create a MultiAttributeSource that is instantiated 
in the MultiDocsAndPositionsEnum. At creation time, all the AttributeSource of 
the subreaders (which are available) will be passed in its constructor. This 
MultiAttributeSource will delegate the getAttribute call to the right 
DocsAndPositionsEnum$AttributeSource. 

There is not a single AttributeSource shared by all the subreader, but each 
subreader keeps its own AttributeSource. In this way, attributes are not 
overridden. The MultiAttributeSource is in fact like a Wrapper.

One problem is when there is custom attributes, e.g. BoostAttribute. If I 
understand correctly, if the user tries to access the BoostAttribute, but one 
of the subreader does not know it, the IllegalArgumentException will be thrown. 
Under the hood, the MultiAttributeSource can check if the attribute exists on 
the current subreader, and if not it can rely on a default attribute, or a 
previously stored attribute (coming from a previous subreader).

I am not sure if what I am saying is making some sense. It looks to me too 
simple to cover all the cases.  Are there cases I am not aware of ? Could you 
give me some examples to make me aware of other problems ?

> Need a clean way for Dir/MultiReader to "merge" the AttributeSources of the 
> sub-readers
> ---
>
> Key: LUCENE-2154
> URL: https://issues.apache.org/jira/browse/LUCENE-2154
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Michael McCandless
> Fix For: Flex Branch
>
>
> The flex API allows extensibility at the Fields/Terms/Docs/PositionsEnum 
> levels, for a codec to set custom attrs.
> But, it's currently broken for Dir/MultiReader, which must somehow share 
> attrs across all the sub-readers.  Somehow we must make a single attr source, 
> and tell each sub-reader's enum to use that instead of creating its own.  
> Hopefully Uwe can work some magic here :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2154) Need a clean way for Dir/MultiReader to "merge" the AttributeSources of the sub-readers

2010-02-09 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831489#action_12831489
 ] 

Uwe Schindler commented on LUCENE-2154:
---

The problem is the following:

Attributes are not to be retrieved on every call to next(), they are get/added 
after construction. If you have a consumer of your MultiEnum, it calls 
attributes().getAttribute exactly one time before start to enumerate 
tokens/positions/whatever. If your proposed MultiAttributeSource would return 
the attribute of the first sub-enum, the consumer would stay with this 
attribute instance forever. If the MultiEnum then changes to another sub-enum, 
the consumer would not see the new attribute.

Because of that the right way is not to have a MultiAttributeSource. What you 
need are proxy attributes. The Attributes itsself must be proxies, delegating 
the call to the current enum's corresponding attribute. The same was done in 
Lucene 2.9 to emulate the backwards compatibility for TokenStreams. The proxy 
was TokenWrapper. These ProxyAttributes would look exactly like this 
TokenWrapper impl class.

> Need a clean way for Dir/MultiReader to "merge" the AttributeSources of the 
> sub-readers
> ---
>
> Key: LUCENE-2154
> URL: https://issues.apache.org/jira/browse/LUCENE-2154
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Michael McCandless
> Fix For: Flex Branch
>
>
> The flex API allows extensibility at the Fields/Terms/Docs/PositionsEnum 
> levels, for a codec to set custom attrs.
> But, it's currently broken for Dir/MultiReader, which must somehow share 
> attrs across all the sub-readers.  Somehow we must make a single attr source, 
> and tell each sub-reader's enum to use that instead of creating its own.  
> Hopefully Uwe can work some magic here :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Build failed in Hudson: Lucene-trunk #1088

2010-02-09 Thread Michael McCandless
On Tue, Feb 9, 2010 at 9:31 AM, Uwe Schindler  wrote:
> The TestSpellChecker Executor problem seems to be a sun bug fixed in JDK 
> 1.5.0_17 (awaitTermination problem: 
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6576792 and related bugs). 
> We updated lucene-zones's JVM for builds to the latest 1.5.0_22.
>
> Thanks Mike!

You're welcome!  Let's hope it fixes this hang...

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2154) Need a clean way for Dir/MultiReader to "merge" the AttributeSources of the sub-readers

2010-02-09 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831521#action_12831521
 ] 

Renaud Delbru commented on LUCENE-2154:
---

I see. The problem is to return to the consumer a unique attribute reference 
when attributes().getAttribute is called, and then updates the references when 
iterating the enums in order to propagate the attribute changes to the consumer.

I am trying to propose a (possible) alternative solution (if I understood the 
problem correctly), which can avoid reflection, but could potentially need a 
modification of the Attribute interface.

If the MultiAttributeSource will create its own set of unique references for 
each attribute (the list of different attribute classes can be retrieved by 
calling the getAttributeClassesIterator() method of the AttributeSource for 
each subreader, we can then create a list of unique references, one reference 
for each type of attributes), the goal is then to update these references after 
each enum iteration or sub-enum change (in order to propagate the changes to 
the consumer).

Unfortunately, I don't see any interface on the Attribute interface to 'copy' a 
given attribute. Each AttributeImpl could implement this 'copy method', which 
copies the state of a given attribute of the same class.
Then, in the MultiDocsAndPositionsEnum, after each iteration or each sub-enum 
change, a call to MultiAttributeSource can be made explicitly to update the 
unique references of the different attributes. This update method will under 
the hood (1) check if the sub-enum is aware of the attribute class, (2) get the 
attribute from the sub-enum, and (3) copy the attribute to the unique attribute 
reference kept by MultiAttributeSource.

Could this solution possibly work ?

> Need a clean way for Dir/MultiReader to "merge" the AttributeSources of the 
> sub-readers
> ---
>
> Key: LUCENE-2154
> URL: https://issues.apache.org/jira/browse/LUCENE-2154
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Michael McCandless
> Fix For: Flex Branch
>
>
> The flex API allows extensibility at the Fields/Terms/Docs/PositionsEnum 
> levels, for a codec to set custom attrs.
> But, it's currently broken for Dir/MultiReader, which must somehow share 
> attrs across all the sub-readers.  Somehow we must make a single attr source, 
> and tell each sub-reader's enum to use that instead of creating its own.  
> Hopefully Uwe can work some magic here :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Hudson build is back to normal : Lucene-trunk #1089

2010-02-09 Thread Apache Hudson Server
See 



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2257) relax the per-segment max unique term limit

2010-02-09 Thread Michael McCandless (JIRA)
relax the per-segment max unique term limit
---

 Key: LUCENE-2257
 URL: https://issues.apache.org/jira/browse/LUCENE-2257
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9.2, 3.0.1, 3.1


Lucene can't handle more than 2.1B (limit of signed 32 bit int) unique terms in 
a single segment.

But I think we can improve this to termIndexInterval (default 128) * 2.1B.  
There is one place (internal API only) where Lucene uses an int but should use 
a long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2257) relax the per-segment max unique term limit

2010-02-09 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2257:
---

Attachment: LUCENE-2257.patch

Possible patch fixing the issue.  I'm not yet certain there is no other place 
where we use an int...

> relax the per-segment max unique term limit
> ---
>
> Key: LUCENE-2257
> URL: https://issues.apache.org/jira/browse/LUCENE-2257
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2257.patch
>
>
> Lucene can't handle more than 2.1B (limit of signed 32 bit int) unique terms 
> in a single segment.
> But I think we can improve this to termIndexInterval (default 128) * 2.1B.  
> There is one place (internal API only) where Lucene uses an int but should 
> use a long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2154) Need a clean way for Dir/MultiReader to "merge" the AttributeSources of the sub-readers

2010-02-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831646#action_12831646
 ] 

Michael McCandless commented on LUCENE-2154:


What if we require that all segments are the same codec, if you want to use 
attributes from a Multi*Enum?  (I think this limitation is fine... and if it's 
not, one could still operate per-segment with different attr impls per segment).

This way, every segment would share the same attr impl for a given attr 
interface?

And then couldn't we somehow force each segment to use the same attr impl as 
the last segment(s)?

> Need a clean way for Dir/MultiReader to "merge" the AttributeSources of the 
> sub-readers
> ---
>
> Key: LUCENE-2154
> URL: https://issues.apache.org/jira/browse/LUCENE-2154
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Michael McCandless
> Fix For: Flex Branch
>
>
> The flex API allows extensibility at the Fields/Terms/Docs/PositionsEnum 
> levels, for a codec to set custom attrs.
> But, it's currently broken for Dir/MultiReader, which must somehow share 
> attrs across all the sub-readers.  Somehow we must make a single attr source, 
> and tell each sub-reader's enum to use that instead of creating its own.  
> Hopefully Uwe can work some magic here :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Release Lucene Java 2.9.2 & 3.0.(1|2) together soon

2010-02-09 Thread Grant Ingersoll

On Feb 7, 2010, at 8:45 AM, Michael McCandless wrote:

> +1 to release.  Thank you for volunteering :)  We've got a number of
> good bug fixes pending...
> 
> But: I think we should simply name it 3.0.1?  If we skip 3.0.1 I think

I'd agree.  Stick w/ 3.0.1

-Grant

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

2010-02-09 Thread Fuad Efendi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829163#action_12829163
 ] 

Fuad Efendi edited comment on LUCENE-2230 at 2/9/10 9:17 PM:
-

After long-run load-stress tests...

I used 2 boxes, one with SOLR, another one with simple multithreaded stress 
simulator (with randomply generated fuzzy query samples); each box is 2x AMD 
Opteron 2350 (8 core per box); 64-bit.

I disabled all SOLR caches except Document Cache (I want isolated tests; I want 
to ignore time taken by disk I/O to load document).

Performance boosted accordingly to number of load-stress threads (on "client" 
computer), then dropped: 

9 Threads:
==
TPS: 200 - 210
Response: 45 - 50 (ms)

10 Threads:
===
TPS: 200 - 215
Response: 45 - 55 (ms)

12 Threads:
===
TPS: 180 - 220
Response: 50 - 90 (ms)
 
16 Threads:
===
TPS: 60 - 65
Response: 230 - 260 (ms)
 

It can be explained by CPU-bound processing and 8 cores available; "top" 
command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step 
(12 stressing threads), and 200% on 4th step (16 stressing threads) - due 
probably to Network I/O, Tomcat internals, etc.

It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp 
(persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads 
according to use case.

BTW, my best counters for default SOLR/Lucene were:
TPS: 12
Response: 750ms

"Fuzzy" queries were tuned such a way that distance threshold was less than or 
equal two. I used "StrikeAMatch" distance...

Thanks,
http://www.tokenizer.ca
+1 416-993-2060(cell)

P.S.
Before performing load-stress tests, I established the baseline in my 
environment: 1500 TPS by pinging http://x.x.x.x:8080/apache-solr-1.4/admin/ 
(static JSP).
And, I reached 220TPS for fuzzy search, starting from 12-15TPS (default 
Lucene/SOLR)...



  was (Author: funtick):
After long-run load-stress tests...

I used 2 boxes, one with SOLR, another one with simple multithreaded stress 
simulator (with randomply generated fuzzy query samples); each box is 2x AMD 
Opteron 2350 (8 core per box); 64-bit.

I disabled all SOLR caches except Document Cache (I want isolated tests; I want 
to ignore time taken by disk I/O to load document).

Performance boosted accordingly to number of load-stress threads (on "client" 
computer), then dropped: 

9 Threads:
==
TPS: 200 - 210
Response: 45 - 50 (ms)

10 Threads:
===
TPS: 200 - 215
Response: 45 - 55 (ms)

12 Threads:
===
TPS: 180 - 220
Response: 50 - 90 (ms)
 
16 Threads:
===
TPS: 60 - 65
Response: 230 - 260 (ms)
 

It can be explained by CPU-bound processing and 8 cores available; "top" 
command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step 
(12 stressing threads), and 200% on 4th step (16 stressing threads) - due 
probably to Network I/O, Tomcat internals, etc.

It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp 
(persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads 
according to use case.

BTW, my best counters for default SOLR/Lucene were:
TPS: 12
Response: 750ms

"Fuzzy" queries were tuned such a way that distance threshold was less than or 
equal two. I used "StrikeAMatch" distance...

Thanks,
http://www.tokenizer.ca
+1 416-993-2060(cell)
  
> Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
> 
>
> Key: LUCENE-2230
> URL: https://issues.apache.org/jira/browse/LUCENE-2230
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 3.0
> Environment: Lucene currently uses brute force full-terms scanner and 
> calculates distance for each term. New BKTree structure improves performance 
> in average 20 times when distance is 1, and 3 times when distance is 3. I 
> tested with index size several millions docs, and 250,000 terms. 
> New algo uses integer distances between objects.
>Reporter: Fuad Efendi
> Attachments: BKTree.java, Distance.java, DistanceImpl.java, 
> FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java
>
>   Original Estimate: 0.02h
>  Remaining Estimate: 0.02h
>
> W. Burkhard and R. Keller. Some approaches to best-match file searching, 
> CACM, 1973
> http://portal.acm.org/citation.cfm?doid=362003.362025
> I was inspired by 
> http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick 
> Johnson, Google).
> Additionally, simplified algorythm at 
> http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more 
> logically correct than Levenstein distance, and it is 3-5 times faster 
> (isolated tests).
> Big list od distance implementations:
> http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

-- 
This message is automatically 

[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

2010-02-09 Thread Fuad Efendi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829163#action_12829163
 ] 

Fuad Efendi edited comment on LUCENE-2230 at 2/9/10 9:35 PM:
-

After long-run load-stress tests...

I used 2 boxes, one with SOLR, another one with simple multithreaded stress 
simulator (with randomply generated fuzzy query samples); each box is 2x AMD 
Opteron 2350 (8 core per box); 64-bit.

I disabled all SOLR caches except Document Cache (I want isolated tests; I want 
to ignore time taken by disk I/O to load document).

Performance boosted accordingly to number of load-stress threads (on "client" 
computer), then dropped: 

9 Threads:
==
TPS: 200 - 210
Response: 45 - 50 (ms)

10 Threads:
===
TPS: 200 - 215
Response: 45 - 55 (ms)

12 Threads:
===
TPS: 180 - 220
Response: 50 - 90 (ms)
 
16 Threads:
===
TPS: 60 - 65
Response: 230 - 260 (ms)
 

It can be explained by CPU-bound processing and 8 cores available; "top" 
command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step 
(12 stressing threads), and 200% on 4th step (16 stressing threads) - due 
probably to Network I/O, Tomcat internals, etc.

It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp 
(persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads 
according to use case.

BTW, my best counters for default SOLR/Lucene were:
TPS: 12
Response: 750ms

"Fuzzy" queries were tuned such a way that distance threshold was less than or 
equal two. I used "StrikeAMatch" distance...

Thanks,
http://www.tokenizer.ca
+1 416-993-2060(cell)

P.S.
Before performing load-stress tests, I established the baseline in my 
environment: 1500 TPS by pinging http://x.x.x.x:8080/apache-solr-1.4/admin/ 
(static JSP).
And, I reached 220TPS for fuzzy search, starting from 12-15TPS (default 
Lucene/SOLR)...

P.P.S.
Distance function must follow 3 'axioms':
{code}
D(a,a) = 0
D(a,b) = D(b,a)
D(a,b) + D(b,c) >= D(a,c)
{code}

And, function must return Integer value.

Otherwise, BKTree will produce wrong results. 


Also, it's mentioned somewhere in Levenstein Algo Java Docs (in contrib folder 
I believe) that instance method runs faster than static method; need to test 
with Java 6... most probably 'yes', depends on JVM implementation; I can guess 
only that CPU-internals are better optimized for instance method...

  was (Author: funtick):
After long-run load-stress tests...

I used 2 boxes, one with SOLR, another one with simple multithreaded stress 
simulator (with randomply generated fuzzy query samples); each box is 2x AMD 
Opteron 2350 (8 core per box); 64-bit.

I disabled all SOLR caches except Document Cache (I want isolated tests; I want 
to ignore time taken by disk I/O to load document).

Performance boosted accordingly to number of load-stress threads (on "client" 
computer), then dropped: 

9 Threads:
==
TPS: 200 - 210
Response: 45 - 50 (ms)

10 Threads:
===
TPS: 200 - 215
Response: 45 - 55 (ms)

12 Threads:
===
TPS: 180 - 220
Response: 50 - 90 (ms)
 
16 Threads:
===
TPS: 60 - 65
Response: 230 - 260 (ms)
 

It can be explained by CPU-bound processing and 8 cores available; "top" 
command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step 
(12 stressing threads), and 200% on 4th step (16 stressing threads) - due 
probably to Network I/O, Tomcat internals, etc.

It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp 
(persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads 
according to use case.

BTW, my best counters for default SOLR/Lucene were:
TPS: 12
Response: 750ms

"Fuzzy" queries were tuned such a way that distance threshold was less than or 
equal two. I used "StrikeAMatch" distance...

Thanks,
http://www.tokenizer.ca
+1 416-993-2060(cell)

P.S.
Before performing load-stress tests, I established the baseline in my 
environment: 1500 TPS by pinging http://x.x.x.x:8080/apache-solr-1.4/admin/ 
(static JSP).
And, I reached 220TPS for fuzzy search, starting from 12-15TPS (default 
Lucene/SOLR)...


  
> Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
> 
>
> Key: LUCENE-2230
> URL: https://issues.apache.org/jira/browse/LUCENE-2230
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 3.0
> Environment: Lucene currently uses brute force full-terms scanner and 
> calculates distance for each term. New BKTree structure improves performance 
> in average 20 times when distance is 1, and 3 times when distance is 3. I 
> tested with index size several millions docs, and 250,000 terms. 
> New algo uses integer distances between objects.
>Reporter: Fuad Efendi
>

[jira] Updated: (LUCENE-2111) Wrapup flexible indexing

2010-02-09 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2111:


Attachment: LUCENE-2111_fuzzy.patch

Mike, here is a patch for removal of fuzzy nocommits:
* remove synchronization (not necessary, history here: LUCENE-296)
* reuse char[] rather than create Strings
* remove unused ctors

> Wrapup flexible indexing
> 
>
> Key: LUCENE-2111
> URL: https://issues.apache.org/jira/browse/LUCENE-2111
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, 
> LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, 
> LUCENE-2111_fuzzy.patch
>
>
> Spinoff from LUCENE-1458.
> The flex branch is in fairly good shape -- all tests pass, initial search 
> performance testing looks good, it survived several visits from the Unicode 
> policeman ;)
> But it still has a number of nocommits, could use some more scrutiny 
> especially on the "emulate old API on flex index" and vice/versa code paths, 
> and still needs some more performance testing.  I'll do these under this 
> issue, and we should open separate issues for other self contained fixes.
> The end is in sight!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2111) Wrapup flexible indexing

2010-02-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831865#action_12831865
 ] 

Robert Muir commented on LUCENE-2111:
-

btw, i benched that patch with my contrived benchmark for LUCENE-2089, wierd 
that flex was slower than trunk before.
numbers are stable across many iterations.
||unpatched flex||patched flex||trunk||
|4362ms|3239ms|3459ms|

> Wrapup flexible indexing
> 
>
> Key: LUCENE-2111
> URL: https://issues.apache.org/jira/browse/LUCENE-2111
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, 
> LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, 
> LUCENE-2111_fuzzy.patch
>
>
> Spinoff from LUCENE-1458.
> The flex branch is in fairly good shape -- all tests pass, initial search 
> performance testing looks good, it survived several visits from the Unicode 
> policeman ;)
> But it still has a number of nocommits, could use some more scrutiny 
> especially on the "emulate old API on flex index" and vice/versa code paths, 
> and still needs some more performance testing.  I'll do these under this 
> issue, and we should open separate issues for other self contained fixes.
> The end is in sight!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org