[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703034#action_12703034
 ] 

Michael McCandless commented on LUCENE-1616:


Should we deprecate the separate setters with this addition?

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703038#action_12703038
 ] 

Uwe Schindler commented on LUCENE-1616:
---

Not really, the attributes API was added for 2.9, so it did not appear until 
now in official releases, it could be just removed.

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: CHANGES.txt

2009-04-27 Thread Michael McCandless
OK I fixed CHANGES.txt to not have double entries for the same issue
in 2.4.1 and trunk (ie, the entry is only in 2.4.1's CHANGES section).

And going forward, if a trunk issue gets backported to a point
release, we should de-dup the entries on releasing the point release.
Ie, before the point release is released, trunk can contain XXX as
well as the branch for the point release, but on copying back the
branch's CHANGES entries, we de-dup then.  I'll update ReleaseTodo in
the wiki.

Thanks Steven!

Mike

On Sat, Apr 25, 2009 at 5:43 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 On Fri, Apr 24, 2009 at 7:17 PM, Steven A Rowe sar...@syr.edu wrote:

 Maybe even tiny bug fixes should always be called out on trunk's
 CHANGES.  Or, maybe a tiny bug fix that also gets backported to a
 point release, must then be called out in both places?  I think I
 prefer the 2nd.

 The difference between these two options is that in the 2nd, tiny bug fixes 
 are mentioned in trunk's CHANGES only if they are backported to a point 
 release, right?

 For the record, the previous policy (the zeroth option :) appears to be that 
 backported bug fixes, regardless of size, are mentioned only once, in the 
 CHANGES for the (chronologically) first release in which they appeared.  You 
 appear to oppose this policy, because (paraphrasing): people would wonder 
 whether point release fixes were also fixed on following major/minor 
 releases.  IMNSHO, however, people (sometimes erroneously) view product 
 releases as genetically linear: naming a release A.(B)[.x] implies 
 inclusion of all changes to any release A.B[.y].  I.e., my sense is quite 
 the opposite of yours: I would be *shocked* if bug fixes included in version 
 2.4.1 were not included (or explicitly called out as not included) in 
 version 2.9.0.

 If more than one point release branch is active at any one time, then things 
 get more complicated (genetic linearity can no longer be assumed), and your 
 new policy seems like a reasonable attempt at managing the complexity.  But 
 will Lucene ever have more than one active bugfix branch?  It never has 
 before.

 But maybe I'm not understanding your intent: are you distinguishing between 
 released CHANGES and unreleased CHANGES?  That is, do you intend to apply 
 this new policy only to the unreleased trunk CHANGES, but then remove the 
 redundant bugfix notices once a release is performed?

 OK you've convinced me (to go back to the 0th policy)!  Users can and
 should assume on seeing a point release containing XXX that all future
 releases also include XXX.  Ie, CHANGES should not be a vehicle for
 confirming that this is what happened.

 So if XXX is committed to trunk and got a CHANGES entry, if it a later
 time it's back ported to a point release, I will remove the XXX from
 the trunk CHANGES and put it *only* on the point releases CHANGES.

 Also, I'll go and fix CHANGES, to remove the trunk entries when
 there's a point-release entry, if nobody objects in the next day or
 so.

 Mike


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1615) deprecated method used in fieldsReader / setOmitTf()

2009-04-27 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1615.


Resolution: Fixed

OK I just committed this -- thanks Eks!

 deprecated method used in fieldsReader / setOmitTf()
 

 Key: LUCENE-1615
 URL: https://issues.apache.org/jira/browse/LUCENE-1615
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Eks Dev
Priority: Trivial
 Attachments: LUCENE-1615.patch


 setOmitTf(boolean) is deprecated and should not be used by core classes. One 
 place where it appears is FieldsReader , this patch fixes it. It was 
 necessary to change Fieldable to AbstractField at two places, only local 
 variables.   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703041#action_12703041
 ] 

Michael McCandless commented on LUCENE-1616:


Oh yeah :)  Good!  I'm losing track of what's not yet released...

Eks, can you update the patch with that?  Thanks.

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1616:
---

Fix Version/s: 2.9

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703054#action_12703054
 ] 

Earwin Burrfoot commented on LUCENE-1616:
-

Separate setters might have their own use? I believe I had a pair of filters 
that set begin and end offset in different parts of the code.

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703062#action_12703062
 ] 

Michael McCandless commented on LUCENE-1616:


But surely that's a very rare case (the exception, not the rule).  Ie nearly 
always, one sets start  end offset together?

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Adding testpackage to common-build.xml

2009-04-27 Thread Michael McCandless
This sounds like a great change!  It would also allow us to test the
other (index, store, etc.) packages too?

I don't think this is possible today, though I'm no expert with ant so
it's entirely possible I've missed it.

Presumably once we modularize, then the module would be the
natural unit for testing (but it seems like this will be a ways
off...).

Mike

On Mon, Apr 27, 2009 at 7:01 AM, Shai Erera ser...@gmail.com wrote:
 Hi

 I noticed that one can define testcase to execute just one test class,
 which is convenient. However, I didn't notice any equivalent for testing a
 whole package. Lately, when handling all those search issues, I often wanted
 to run all the tests in o.a.l.search and just them, but couldn't so I either
 ran a single test class when it was obvious (like TestSort) or the
 test-core when it was less obvious (like changes to Collector, or
 BooleanScorer).

 I wrote a simple patch which adds this capability to common-build.xml. I
 would like to confirm first that you agree to add this change and that I
 didn't miss it and this capability exists elsewhere already.

 Shai


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Adding testpackage to common-build.xml

2009-04-27 Thread Shai Erera
ok then I'll open an issue and post the patch. You can review it and give it
a try

Shai

On Mon, Apr 27, 2009 at 2:10 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 This sounds like a great change!  It would also allow us to test the
 other (index, store, etc.) packages too?

 I don't think this is possible today, though I'm no expert with ant so
 it's entirely possible I've missed it.

 Presumably once we modularize, then the module would be the
 natural unit for testing (but it seems like this will be a ways
 off...).

 Mike

 On Mon, Apr 27, 2009 at 7:01 AM, Shai Erera ser...@gmail.com wrote:
  Hi
 
  I noticed that one can define testcase to execute just one test class,
  which is convenient. However, I didn't notice any equivalent for testing
 a
  whole package. Lately, when handling all those search issues, I often
 wanted
  to run all the tests in o.a.l.search and just them, but couldn't so I
 either
  ran a single test class when it was obvious (like TestSort) or the
  test-core when it was less obvious (like changes to Collector, or
  BooleanScorer).
 
  I wrote a simple patch which adds this capability to common-build.xml. I
  would like to confirm first that you agree to add this change and that I
  didn't miss it and this capability exists elsewhere already.
 
  Shai
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703067#action_12703067
 ] 

Earwin Burrfoot commented on LUCENE-1616:
-

I have two cases.
In one case I can't access the start offset by the time I set end offset, and 
therefore have to introduce a field on the filter for keeping track of it (or 
use the next case's solution twice), if separate setters are removed.
In other case I only need to adjust end offset, so I'll have to do 
attr.setOffset(attr.getStartOffset(), newEndOffset).
Nothing deadly, but I don't see the point of removing methods that might be 
useful and don't interfere with anything.

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Adding testpackage to common-build.xml

2009-04-27 Thread Shai Erera
Hi

I noticed that one can define testcase to execute just one test class,
which is convenient. However, I didn't notice any equivalent for testing a
whole package. Lately, when handling all those search issues, I often wanted
to run all the tests in o.a.l.search and just them, but couldn't so I either
ran a single test class when it was obvious (like TestSort) or the
test-core when it was less obvious (like changes to Collector, or
BooleanScorer).

I wrote a simple patch which adds this capability to common-build.xml. I
would like to confirm first that you agree to add this change and that I
didn't miss it and this capability exists elsewhere already.

Shai


[jira] Updated: (LUCENE-1617) Add testpackage to common-build.xml

2009-04-27 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1617:
---

Attachment: LUCENE-1617.patch

 Add testpackage to common-build.xml
 -

 Key: LUCENE-1617
 URL: https://issues.apache.org/jira/browse/LUCENE-1617
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Build
Reporter: Shai Erera
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1617.patch


 One can define testcase to execute just one test class, which is 
 convenient. However, I didn't notice any equivalent for testing a whole 
 package. I find it convenient to be able to test packages rather than test 
 cases because often it is not so clear which test class to run.
 Following patch allows one to ant test -Dtestpackage=search (for example) 
 and run all tests under the \*/search/\* packages in core, contrib and tags, 
 or do ant test-core -Dtestpackage=search and execute similarly just for 
 core, or do ant test-core -Dtestpacakge=lucene/search/function and run all 
 the tests under \*/lucene/search/function/\* (just in case there is another 
 o.a.l.something.search.function package out there which we want to exclude.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1617) Add testpackage to common-build.xml

2009-04-27 Thread Shai Erera (JIRA)
Add testpackage to common-build.xml
-

 Key: LUCENE-1617
 URL: https://issues.apache.org/jira/browse/LUCENE-1617
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Build
Reporter: Shai Erera
Priority: Minor
 Fix For: 2.9
 Attachments: LUCENE-1617.patch

One can define testcase to execute just one test class, which is convenient. 
However, I didn't notice any equivalent for testing a whole package. I find it 
convenient to be able to test packages rather than test cases because often it 
is not so clear which test class to run.

Following patch allows one to ant test -Dtestpackage=search (for example) and 
run all tests under the \*/search/\* packages in core, contrib and tags, or do 
ant test-core -Dtestpackage=search and execute similarly just for core, or do 
ant test-core -Dtestpacakge=lucene/search/function and run all the tests 
under \*/lucene/search/function/\* (just in case there is another 
o.a.l.something.search.function package out there which we want to exclude.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703085#action_12703085
 ] 

Eks Dev commented on LUCENE-1616:
-

I am ok with both options, removing separate looks a bit better for me as it 
forces users to think attomic about offset = {start, end}. 

If you separate start and end offset too far in your code, probability that you 
do not see mistake somewhere is higher compared to the case where you manage 
start and end on your own in these cases (this is then rather explicit in you 
code)... 

But that is all really something we should not think too much about it :) We 
make no mistakes eather way
 
I can provide new patch, if needed. 

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms

2009-04-27 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1604:
--

Assignee: Michael McCandless

 Stop creating huge arrays to represent the absense of field norms
 -

 Key: LUCENE-1604
 URL: https://issues.apache.org/jira/browse/LUCENE-1604
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
Reporter: Shon Vella
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1604.patch, LUCENE-1604.patch, LUCENE-1604.patch


 Creating and keeping around huge arrays that hold a constant value is very 
 inefficient both from a heap usage standpoint and from a localility of 
 reference standpoint. It would be much more efficient to use null to 
 represent a missing norms table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703101#action_12703101
 ] 

Michael McCandless commented on LUCENE-1604:



I tested this change on a Wikipedia index, with query 1, on a field
that has norms.

On Linux, JDK 1.6.0_13, I can see no performance difference (both get
7.2 qps, best of 10 runs).

On Mac OS X 10.5.6, I see some difference (13.0 vs 12.3, best of 10
runs), but given quirkiness I've seen on OS X's results not matching
other platforms, I think we can disgregard this.

Also, given the performance gain one sees when norms are disabled, I
think this is net/net a good change.

We'll leave the default as false (for back compat), but this setting
is deprecated with a comment that in 3.0 it hardwires to true.


 Stop creating huge arrays to represent the absense of field norms
 -

 Key: LUCENE-1604
 URL: https://issues.apache.org/jira/browse/LUCENE-1604
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
Reporter: Shon Vella
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1604.patch, LUCENE-1604.patch, LUCENE-1604.patch


 Creating and keeping around huge arrays that hold a constant value is very 
 inefficient both from a heap usage standpoint and from a localility of 
 reference standpoint. It would be much more efficient to use null to 
 represent a missing norms table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703103#action_12703103
 ] 

Michael McCandless commented on LUCENE-1604:


New patch attached:

  * Fixed contrib/instantiated  contrib/misc to pass if I change
default for disableFakeNorms to true (which we will hardwire in
3.0)

  * Tweaked javadocs

  * Removed unused imports

  * Added CHANGES.txt entry

I still need to review the rest of the patch...

With this patch, all tests pass with the default set to false
(back-compat).  If I temporarily set it to true, all tests now pass,
except back-compat (which is expected  fine).

I had started down the path of having contrib/instantiated respect
the disableFakeNorms setting, but rapidly came to realize how little I
understand contrib/instantiated's code ;) So I fell back to fixing the
unit tests to accept null returns from the normal
IndexReader.norms(...).


 Stop creating huge arrays to represent the absense of field norms
 -

 Key: LUCENE-1604
 URL: https://issues.apache.org/jira/browse/LUCENE-1604
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
Reporter: Shon Vella
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1604.patch, LUCENE-1604.patch, LUCENE-1604.patch


 Creating and keeping around huge arrays that hold a constant value is very 
 inefficient both from a heap usage standpoint and from a localility of 
 reference standpoint. It would be much more efficient to use null to 
 represent a missing norms table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms

2009-04-27 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1604:
---

Attachment: LUCENE-1604.patch

Attached patch.

I also added assert !getDisableFakeNorms(); inside SegmentReader.fakeNorms().

 Stop creating huge arrays to represent the absense of field norms
 -

 Key: LUCENE-1604
 URL: https://issues.apache.org/jira/browse/LUCENE-1604
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
Reporter: Shon Vella
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1604.patch, LUCENE-1604.patch, LUCENE-1604.patch, 
 LUCENE-1604.patch


 Creating and keeping around huge arrays that hold a constant value is very 
 inefficient both from a heap usage standpoint and from a localility of 
 reference standpoint. It would be much more efficient to use null to 
 represent a missing norms table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-27 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1593:
---

Attachment: PerfTest.java
LUCENE-1593.patch

The patch implements all that has been suggested except:
* pre-populating the queue in TopFieldCollector - as was noted here previously, 
this seems to remove the 'if (queueFull)' check but add another 'if' in 
FieldComparator (which may be executed several times per collect().
* Move initCountingSumScorer() to BS2's ctor and add(). That's because if more 
than one Scorer is added we create a DisjunctionSumScorer, which initializes 
its queue by calling next() on the passed-in Scorer. Therefore if we call 
initCountingSumScorer for every Scorer added, we advance that Scorer as well as 
all the previous ones. I chose to discard that optimization, which only affects 
next() and skipTo().

The patch also includes the fix for TestSort in the 2.4 back_compat branch. I 
only fixed TestSort, and not MultiSearcher and ParallelMultiSearcher.

All tests pass.

I also ran some performance measurements (all on SRV 2003):

|| JRE || sort || best time (trunk) || best time (patch) || diff (%) ||
| SUN 1.6 | int | 1017.59 | 1015.96 | {color:green}~1%{color} |
| SUN 1.6 | doc | 767.49 | 763.20 | {color:green}~1%{color} |
| IBM 1.5 | int | 1018.77 | 1017.39 | {color:green}~1%{color} |
| IBM 1.5 | doc | 768.10 | 764.14 | {color:green}~1%{color} |

As you can see, there is a slight performance improvement, but nothing too 
dramatic.

You are welcome to review the patch as well as run the PerfTest I attached. It 
accepts two arguments: indexDir and [sort]. 'sort' is optional and if not 
defined it sorts by doc. Otherwise, whatever you pass there, it sorts by int.

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703142#action_12703142
 ] 

Michael McCandless commented on LUCENE-1604:


OK patch looks good!  I plan to commit in a day or two.  Thanks Shon!

 Stop creating huge arrays to represent the absense of field norms
 -

 Key: LUCENE-1604
 URL: https://issues.apache.org/jira/browse/LUCENE-1604
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
Reporter: Shon Vella
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1604.patch, LUCENE-1604.patch, LUCENE-1604.patch, 
 LUCENE-1604.patch


 Creating and keeping around huge arrays that hold a constant value is very 
 inefficient both from a heap usage standpoint and from a localility of 
 reference standpoint. It would be much more efficient to use null to 
 represent a missing norms table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1617) Add testpackage to common-build.xml

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703156#action_12703156
 ] 

Michael McCandless commented on LUCENE-1617:


I would like to be able to run them (one use case for this would be to 
parallelize tests -- I do this now (Python script) by running 
test-core/test-tag/test-contrib in parallel, but it's mis-balanced because 
contrib finishes so quickly).  How about -Dtestrootonly=true?

 Add testpackage to common-build.xml
 -

 Key: LUCENE-1617
 URL: https://issues.apache.org/jira/browse/LUCENE-1617
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Build
Reporter: Shai Erera
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1617.patch


 One can define testcase to execute just one test class, which is 
 convenient. However, I didn't notice any equivalent for testing a whole 
 package. I find it convenient to be able to test packages rather than test 
 cases because often it is not so clear which test class to run.
 Following patch allows one to ant test -Dtestpackage=search (for example) 
 and run all tests under the \*/search/\* packages in core, contrib and tags, 
 or do ant test-core -Dtestpackage=search and execute similarly just for 
 core, or do ant test-core -Dtestpacakge=lucene/search/function and run all 
 the tests under \*/lucene/search/function/\* (just in case there is another 
 o.a.l.something.search.function package out there which we want to exclude.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703157#action_12703157
 ] 

Earwin Burrfoot commented on LUCENE-1616:
-

bq. removing separate looks a bit better for me as it forces users to think 
attomic about offset = {start, end}.
And if it's not atomic by design?

bq. If you separate start and end offset too far in your code, probability that 
you do not see mistake somewhere is higher compared to the case where you 
manage start and end on your own in these cases (this is then rather explicit 
in you code)...
Instead of having one field for Term, which you build incrementally, you now 
have to keep another field for startOffset. Imho, that's starting to cross into 
another meaning of 'explicit' :)
And while you're trying to prevent bugs of using setStartOffset and forgetting 
about its 'End' counterpart, you introduce another set of bugs - overwriting 
one end of interval, when you only need to update another.

bq. And in general I prefer one clear way to do something
And force everyone who has slightly different use-case to jump through the 
hoops. Span*Query api is a perfect example.

Well, whatever.

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread eks dev

Ok, I'll create another patch a bit later today


- Original Message 
 From: Michael McCandless (JIRA) j...@apache.org
 To: java-dev@lucene.apache.org
 Sent: Monday, 27 April, 2009 16:34:30
 Subject: [jira] Commented: (LUCENE-1616) add one setter for start and end 
 offset to OffsetAttribute
 
 
     [ 
 https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703144#action_12703144
  
 ] 
 
 Michael McCandless commented on LUCENE-1616:
 
 
 bq. removing separate looks a bit better for me as it forces users to think 
 attomic about offset = {start, end}.
 
 This is my thinking as well.
 
 And in general I prefer one clear way to do something (the Python way) 
 instead 
 providing various different ways to do the same thing (the Perl way).
 
  add one setter for start and end offset to OffsetAttribute
  --
 
                 Key: LUCENE-1616
                 URL: https://issues.apache.org/jira/browse/LUCENE-1616
             Project: Lucene - Java
           Issue Type: Improvement
           Components: Analysis
             Reporter: Eks Dev
             Priority: Trivial
             Fix For: 2.9
 
         Attachments: LUCENE-1616.patch
 
 
  add OffsetAttribute. setOffset(startOffset, endOffset);
  trivial change, no JUnit needed
  Changed CharTokenizer to use it
 
 -- 
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.
 
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1617) Add testpackage to common-build.xml

2009-04-27 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1617:
---

Attachment: LUCENE-1617.patch

Added another property testpackageroot. So now you can define:
* testcase - for a single test class
* testpackage - for all classes in a package, including sub-packages
* testpackageroot - for all classes in a package, without sub-packages

But something is strange ... if I run ant test-core it works ok. If I run 
ant test-core -Dtestpackage=lucene few classes fail, like AnalysisTest, 
IndexTest etc. (those that end with Test). That's because they are not 
TestCases ... I wonder why in ant test-core those files are skipped (and I 
see they are not executed) but in testpackage they are not.

Anyway, I'll look into it later, unless someone who is more knowledgeable in 
Ant want to look at it.

This is not ready to be committed, as ant test-core -Dtestpackage=lucene and 
ant test-core -Dtestpackageroot=lucene fail on those non-test cases files.

 Add testpackage to common-build.xml
 -

 Key: LUCENE-1617
 URL: https://issues.apache.org/jira/browse/LUCENE-1617
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Build
Reporter: Shai Erera
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1617.patch, LUCENE-1617.patch


 One can define testcase to execute just one test class, which is 
 convenient. However, I didn't notice any equivalent for testing a whole 
 package. I find it convenient to be able to test packages rather than test 
 cases because often it is not so clear which test class to run.
 Following patch allows one to ant test -Dtestpackage=search (for example) 
 and run all tests under the \*/search/\* packages in core, contrib and tags, 
 or do ant test-core -Dtestpackage=search and execute similarly just for 
 core, or do ant test-core -Dtestpacakge=lucene/search/function and run all 
 the tests under \*/lucene/search/function/\* (just in case there is another 
 o.a.l.something.search.function package out there which we want to exclude.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: CHANGES.txt

2009-04-27 Thread Steven A Rowe
Thank you, Mike, for working to make things better.

Steve

On 4/27/2009 at 5:32 AM, Michael McCandless wrote:
 OK I fixed CHANGES.txt to not have double entries for the same issue
 in 2.4.1 and trunk (ie, the entry is only in 2.4.1's CHANGES section).
 
 And going forward, if a trunk issue gets backported to a point
 release, we should de-dup the entries on releasing the point release.
 Ie, before the point release is released, trunk can contain XXX as
 well as the branch for the point release, but on copying back the
 branch's CHANGES entries, we de-dup then.  I'll update ReleaseTodo in
 the wiki.
 
 Thanks Steven!
 
 Mike
 
 On Sat, Apr 25, 2009 at 5:43 AM, Michael McCandless
 luc...@mikemccandless.com wrote:
  On Fri, Apr 24, 2009 at 7:17 PM, Steven A Rowe sar...@syr.edu
 wrote:
 
  Maybe even tiny bug fixes should always be called out on trunk's
  CHANGES.  Or, maybe a tiny bug fix that also gets backported to a
  point release, must then be called out in both places?  I think I
  prefer the 2nd.
 
  The difference between these two options is that in the 2nd, tiny
 bug fixes are mentioned in trunk's CHANGES only if they are backported
 to a point release, right?
 
  For the record, the previous policy (the zeroth option :) appears to
 be that backported bug fixes, regardless of size, are mentioned only
 once, in the CHANGES for the (chronologically) first release in which
 they appeared.  You appear to oppose this policy, because
 (paraphrasing): people would wonder whether point release fixes were
 also fixed on following major/minor releases.  IMNSHO, however, people
 (sometimes erroneously) view product releases as genetically linear:
 naming a release A.(B)[.x] implies inclusion of all changes to any
 release A.B[.y].  I.e., my sense is quite the opposite of yours: I
 would be *shocked* if bug fixes included in version 2.4.1 were not
 included (or explicitly called out as not included) in version 2.9.0.
 
  If more than one point release branch is active at any one time,
 then things get more complicated (genetic linearity can no longer be
 assumed), and your new policy seems like a reasonable attempt at
 managing the complexity.  But will Lucene ever have more than one
 active bugfix branch?  It never has before.
 
  But maybe I'm not understanding your intent: are you distinguishing
 between released CHANGES and unreleased CHANGES?  That is, do you
 intend to apply this new policy only to the unreleased trunk CHANGES,
 but then remove the redundant bugfix notices once a release is
 performed?
 
  OK you've convinced me (to go back to the 0th policy)!  Users can and
  should assume on seeing a point release containing XXX that all
 future
  releases also include XXX.  Ie, CHANGES should not be a vehicle for
  confirming that this is what happened.
 
  So if XXX is committed to trunk and got a CHANGES entry, if it a
 later
  time it's back ported to a point release, I will remove the XXX from
  the trunk CHANGES and put it *only* on the point releases CHANGES.
 
  Also, I'll go and fix CHANGES, to remove the trunk entries when
  there's a point-release entry, if nobody objects in the next day or
  so.
 
  Mike


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703182#action_12703182
 ] 

Michael McCandless commented on LUCENE-1616:


bq. And force everyone who has slightly different use-case to jump through the 
hoops.

Simple things should be simple and complex things should be possible is a 
strong guide when I'm thinking about APIs, configuration, etc.

My feeling here is for the vast majority of the cases, people set start  end 
offset together, so we should shift to the API that makes that easy.  This is 
the simple case.

For the remaining minority (your interesting use case), you can still do what 
you need but yes there are some hoops to go through.  This is the complex 
case.

bq. Span*Query api is a perfect example.

Can you describe the limitations here in more detail?

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703197#action_12703197
 ] 

Michael McCandless commented on LUCENE-1593:


The patch still has various logic to handle the sentinel values, but we backed 
away from that optimization (it's not generally safe)?

Also, I fear we need to conditionalize the don't need to break ties by docID, 
because BooleanScorer doesn't visit docs in order?

bq. I chose to discard that optimization, which only affects next() and 
skipTo().

Maybe we should add a start() method to Scorer, to handle initializations 
like this, so that next() doesn't have to check every time?

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: perf enhancement and lucene-1345

2009-04-27 Thread Michael McCandless
I think you mean this thread:

  http://markmail.org/message/idgcnxmbyo3yjdiw

right?

I would love to see these in Lucene... P4Delta, which Paul has started
under LUCENE-1410, is clearly a win, but is a biggish change to Lucene
since all offsets would need to become blockID + offsetWithinBlock.
LUCENE-1458 (further steps flexible indexing) tries to make things
generic enough that P4Delta can simply be a different codec.

On the logic operators for combining DocIDSets... how do these differ
from what we already do in BooleanScorer[2]?  (I haven't had a chance
to get a good look at Kamikaze yet).

Mike

On Fri, Apr 24, 2009 at 11:34 PM, John Wang john.w...@gmail.com wrote:
 Hi Guys:
      A while ago I posted some enhancements to disjunction and conjunction
 docIdSetIterators that showed performance improvements to Lucene-1345. I
 think it got mixed up with another discussion on that issue. Was wondering
 what happened with it and what are the plans.
 Thanks
 -John

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703227#action_12703227
 ] 

Michael McCandless commented on LUCENE-1616:


Thanks Eks.  You also need to fix all the places that call the old methods 
(things don't compile w/ the new patch).

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch, LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703228#action_12703228
 ] 

Michael McCandless commented on LUCENE-1614:


Shai did you forget to attach patch here?  Or maybe you're just busy ;)

 Add next() and skipTo() variants to DocIdSetIterator that return the current 
 doc, instead of boolean
 

 Key: LUCENE-1614
 URL: https://issues.apache.org/jira/browse/LUCENE-1614
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 See 
 http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
  for the full discussion. The basic idea is to add variants to those two 
 methods that return the current doc they are at, to save successive calls to 
 doc(). If there are no more docs, return -1. A summary of what was discussed 
 so far:
 # Deprecate those two methods.
 # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
 (calls next() and skipTo() respectively, and will be changed to abstract in 
 3.0).
 #* I actually would like to propose an alternative to the names: advance() 
 and advance(int) - the first advances by one, the second advances to target.
 # Wherever these are used, do something like '(doc = advance()) = 0' instead 
 of comparing to -1 for improved performance.
 I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Eks Dev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1616:


Attachment: LUCENE-1616.patch

whoops, this time it compiles :)

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-04-27 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703233#action_12703233
 ] 

Shai Erera commented on LUCENE-1614:


No I did not forget - I need to work on it (trying to juggle all the issues I 
opened :)) ... in general I don't like to work on overlapping issues and this 
overlaps with 1593 (it will touch some of the same files). But I can start 
working on the patch - it looks much simpler than 1593 ...

One thing I wanted to get feedback on is the proposal to use advance() and 
advance(target). Let's decide on that now, so that I don't need to refactor 
everything afterwards :)

 Add next() and skipTo() variants to DocIdSetIterator that return the current 
 doc, instead of boolean
 

 Key: LUCENE-1614
 URL: https://issues.apache.org/jira/browse/LUCENE-1614
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 See 
 http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
  for the full discussion. The basic idea is to add variants to those two 
 methods that return the current doc they are at, to save successive calls to 
 doc(). If there are no more docs, return -1. A summary of what was discussed 
 so far:
 # Deprecate those two methods.
 # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
 (calls next() and skipTo() respectively, and will be changed to abstract in 
 3.0).
 #* I actually would like to propose an alternative to the names: advance() 
 and advance(int) - the first advances by one, the second advances to target.
 # Wherever these are used, do something like '(doc = advance()) = 0' instead 
 of comparing to -1 for improved performance.
 I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703245#action_12703245
 ] 

Michael McCandless commented on LUCENE-1616:


I still get compilation errors:
{code}
[mkdir] Created dir: /lucene/src/lucene.offsets/build/classes/java
[javac] Compiling 372 source files to 
/lucene/src/lucene.offsets/build/classes/java
[javac] 
/lucene/src/lucene.offsets/src/java/org/apache/lucene/analysis/KeywordTokenizer.java:62:
 cannot find symbol
[javac] symbol  : method setStartOffset(int)
[javac] location: class 
org.apache.lucene.analysis.tokenattributes.OffsetAttribute
[javac]   offsetAtt.setStartOffset(0);
[javac]^
[javac] 
/lucene/src/lucene.offsets/src/java/org/apache/lucene/analysis/KeywordTokenizer.java:63:
 cannot find symbol
[javac] symbol  : method setEndOffset(int)
[javac] location: class 
org.apache.lucene.analysis.tokenattributes.OffsetAttribute
[javac]   offsetAtt.setEndOffset(upto);
[javac]^
[javac] 
/lucene/src/lucene.offsets/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java:164:
 cannot find symbol
[javac] symbol  : method setStartOffset(int)
[javac] location: class 
org.apache.lucene.analysis.tokenattributes.OffsetAttribute
[javac] offsetAtt.setStartOffset(start);
[javac]  ^
[javac] 
/lucene/src/lucene.offsets/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java:165:
 cannot find symbol
[javac] symbol  : method setEndOffset(int)
[javac] location: class 
org.apache.lucene.analysis.tokenattributes.OffsetAttribute
[javac] offsetAtt.setEndOffset(start+termAtt.termLength());
[javac]  ^
[javac] 
/lucene/src/lucene.offsets/src/java/org/apache/lucene/index/DocInverterPerThread.java:56:
 cannot find symbol
[javac] symbol  : method setStartOffset(int)
[javac] location: class 
org.apache.lucene.analysis.tokenattributes.OffsetAttribute
[javac]   offsetAttribute.setStartOffset(startOffset);
[javac]  ^
[javac] 
/lucene/src/lucene.offsets/src/java/org/apache/lucene/index/DocInverterPerThread.java:57:
 cannot find symbol
[javac] symbol  : method setEndOffset(int)
[javac] location: class 
org.apache.lucene.analysis.tokenattributes.OffsetAttribute
[javac]   offsetAttribute.setEndOffset(endOffset);
[javac]  ^
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] 6 errors
{code}

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-04-27 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703251#action_12703251
 ] 

Marvin Humphrey commented on LUCENE-1614:
-

 advance() and advance(int)

In the interest of coherent email exchanges, I think it would be best to give
these methods distinct names, e.g. nudge and advance.


 Add next() and skipTo() variants to DocIdSetIterator that return the current 
 doc, instead of boolean
 

 Key: LUCENE-1614
 URL: https://issues.apache.org/jira/browse/LUCENE-1614
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 See 
 http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
  for the full discussion. The basic idea is to add variants to those two 
 methods that return the current doc they are at, to save successive calls to 
 doc(). If there are no more docs, return -1. A summary of what was discussed 
 so far:
 # Deprecate those two methods.
 # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
 (calls next() and skipTo() respectively, and will be changed to abstract in 
 3.0).
 #* I actually would like to propose an alternative to the names: advance() 
 and advance(int) - the first advances by one, the second advances to target.
 # Wherever these are used, do something like '(doc = advance()) = 0' instead 
 of comparing to -1 for improved performance.
 I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703254#action_12703254
 ] 

Eks Dev commented on LUCENE-1616:
-

me too, sorry! 
Eclipse left me blind for some funny reason
waiting for test to complete before I commit again ... 

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-27 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703266#action_12703266
 ] 

Shai Erera commented on LUCENE-1593:


bq. he patch still has various logic to handle the sentinel values

Are you talking about TSDC? I thought we agreed that initializing to 
Float.NEG_INF is reasonable for TSDC? If not, then I can remove it from there 
as well as the changes done to PQ.

bq. Maybe we should add a start() method to Scorer

Could be useful - but then we should probably do it on DocIdSetIterator with 
default impl, and override where it makes sense (BS and BS2)? Also, if we do 
this, why not adding an end() too, allowing a DISI to release resources?
And if we document that calling next() and skipTo() without start() before that 
may result in an unspecified behavior, it will resemble somewhat to 
TermPositions, where you have to call next() before anything else.

However, this should be done with caution. BS2 calls initCountingSumScorer in 
two places: (1) next() and skipTo() and (2) score(Collector). Now, in the 
latter, it first checks if allowDocsOutOfOrder and if so initializes BS, with 
adding the Scorers that were added in add(). However those Scorers *must not be 
initalized* prior to creating BS, since they will be advanced.
So now it gets tricky - upon call to start(), what should BS2 do? Check 
allowDocsOutOfOrder to determine if to initialize or not? And what if it is 
true but score(Collector) will not be called, and instead next() and skipTo()?
We should also protect against calling start() more than once, and in Scorers 
that aggregate several scorers, we should make sure their start() is called 
after all Scorers were added ... gets a bit complicated. What do you think?

bq. Also, I fear we need to conditionalize the don't need to break ties by 
docID, because BooleanScorer doesn't visit docs in order?

Yes I kept BS and BS2 in mind ... if we condiionalize anything, it means extra 
'if'. If we want to avoid that 'if', we need to create a variant of the class, 
which might not be so bad in TSDC, but will look awful in TFC (additional 6(?) 
classes).
Perhaps we should still attempt to add to PQ if cmp == 0?
Or did you have something else in mind?

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703288#action_12703288
 ] 

Earwin Burrfoot commented on LUCENE-1616:
-

bq.  Span*Query api is a perfect example.
bq. Can you describe the limitations here in more detail?
Take a look at SpanNearQuery and SpanOrQuery.

1. They don't provide incremental construction (i.e. add() method, like in 
BooleanQuery), and they can be built only from an array of subqueries. So, if 
you don't know exact amount of subqueries upfront, you're busted. You have to 
use ArrayList, which you convert to array to feed into SpanQuery, which is 
converted back to ArrayList inside!!
2. They can't be edited. If you have a need to iterate over your query tree and 
modify it in one way or another, you need to create brand new instances of 
Span*Query. And here you hit #1 again, hard.
3. They can't be even inspected without creating a new array from the backing 
list (see getClauses).

I use patched versions of SpanNear/OrQueries, which still use backing 
ArrayList, but accept it in constructor, have utility 'add' method and 
getClauses() returns this very list, which allows for zero-cost inspection and 
easy modification if the need arises.

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-04-27 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703291#action_12703291
 ] 

Shai Erera commented on LUCENE-1614:


nudge doesn't sound like it changes anything, but just touches. So if 
distinct method names is what we're after, I prefer nextDoc() and skipToDoc() 
or advance() for the latter.

 Add next() and skipTo() variants to DocIdSetIterator that return the current 
 doc, instead of boolean
 

 Key: LUCENE-1614
 URL: https://issues.apache.org/jira/browse/LUCENE-1614
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 See 
 http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
  for the full discussion. The basic idea is to add variants to those two 
 methods that return the current doc they are at, to save successive calls to 
 doc(). If there are no more docs, return -1. A summary of what was discussed 
 so far:
 # Deprecate those two methods.
 # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
 (calls next() and skipTo() respectively, and will be changed to abstract in 
 3.0).
 #* I actually would like to propose an alternative to the names: advance() 
 and advance(int) - the first advances by one, the second advances to target.
 # Wherever these are used, do something like '(doc = advance()) = 0' instead 
 of comparing to -1 for improved performance.
 I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-04-27 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703315#action_12703315
 ] 

Jason Rutherglen commented on LUCENE-1313:
--

{quote} When we create SegmentWriteState (which is supposed to
contain all details needed to tell DW how/where to write the
segment), we'd set its directory to the RAMDir? That ought to be
all that's needed (though, it's possible some places use a
private copy of the original directory, which we should fix). DW
should care less which Directory the segment is written to...
{quote}

Agreed that DW can write the segment to the RAMDir. I started
coding along these lines however what do we do about the RAMDir
merging? This is why I was thinking we'll need a separate IW?
Otherwise the ram segments (if they are treated the same as disk
segments) would quickly be merged to disk? Or we have two
separate merging paths?

If we have a disk IW and ram IW, I'm not sure how the docstores
to disk part would work though I'm sure there's some way to do
it.

bq. modify resolveExternalSegments to accept a doMerge?

Sounds good.

 Realtime Search
 ---

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch


 Realtime search with transactional semantics.  
 Possible future directions:
   * Optimistic concurrency
   * Replication
 Encoding each transaction into a set of bytes by writing to a RAMDirectory 
 enables replication.  It is difficult to replicate using other methods 
 because while the document may easily be serialized, the analyzer cannot.
 I think this issue can hold realtime benchmarks which include indexing and 
 searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-04-27 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703317#action_12703317
 ] 

Jason Rutherglen commented on LUCENE-1313:
--

{quote}we should make with NRT is to not close the doc store
(stored fields, term vector) files when flushing for an NRT
reader. {quote}

Agreed, I think this feature is a must otherwise we're doing
unnecessary in ram merging.  

{quote}we'd need to be able to somehow share an IndexInput 
IndexOutput; or, perhaps we can open an IndexInput even though
an IndexOutput{quote}

I ran into problems with this before, I was trying to reuse
Directory to write a transaction log. It seemed theoretically
doable however it didn't work in practice. It could have been
the seeking and replacing but I don't remember. FSIndexOutput
uses a writeable RAF and FSIndexInput is read only why would
there be an issue?




 Realtime Search
 ---

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch


 Realtime search with transactional semantics.  
 Possible future directions:
   * Optimistic concurrency
   * Replication
 Encoding each transaction into a set of bytes by writing to a RAMDirectory 
 enables replication.  It is difficult to replicate using other methods 
 because while the document may easily be serialized, the analyzer cannot.
 I think this issue can hold realtime benchmarks which include indexing and 
 searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-04-27 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703327#action_12703327
 ] 

Jason Rutherglen commented on LUCENE-1313:
--

{quote}doc store files punch straight through to the real
directory{quote}

To implement this functionality in parallel (and perhaps make
the overall patch cleaner), writing doc stores directly to a
separate directory can be a different patch? There can be an
option IW.setDocStoresDirectory(Directory) that the patch
implements? Then some unit tests that are separate from the near
realtime portion. 

 Realtime Search
 ---

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch


 Realtime search with transactional semantics.  
 Possible future directions:
   * Optimistic concurrency
   * Replication
 Encoding each transaction into a set of bytes by writing to a RAMDirectory 
 enables replication.  It is difficult to replicate using other methods 
 because while the document may easily be serialized, the analyzer cannot.
 I think this issue can hold realtime benchmarks which include indexing and 
 searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Eks Dev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1616:


Attachment: LUCENE-1616.patch

ok, maybe this time it will work, I hope I managed to clean it up (core build 
and test pass). 

The only thing that fails is contrib, but I guess this has nothing to do with 
it? 


[javac] 
D:\Repository\SerachAndMatch\Lucene\lucene\java\trunk\contrib\highlighter\src\java\org\apache\lucene\search\highlight\WeightedSpanTermExtractor.java:306:
 cannot find symbol
[javac]   MemoryIndex indexer = new MemoryIndex();
[javac]   ^
[javac]   symbol:   class MemoryIndex
[javac]   location: class 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor
[javac] 
D:\Repository\SerachAndMatch\Lucene\lucene\java\trunk\contrib\highlighter\src\java\org\apache\lucene\search\highlight\WeightedSpanTermExtractor.java:306:
 cannot find symbol
[javac]   MemoryIndex indexer = new MemoryIndex();
[javac] ^
[javac]   symbol:   class MemoryIndex
[javac]   location: class 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 3 errors

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch, 
 LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703329#action_12703329
 ] 

Michael McCandless commented on LUCENE-1593:


bq. I thought we agreed that initializing to Float.NEG_INF is reasonable for 
TSDC? 

Woops, sorry, you're right.  I'm just losing my mind.

I think the javadoc for PriorityQueue.addSentinelObjects should state
that the Objects must all be logically equal?  Ie we do a straight
copy into the pqueue, so if they are not equal then the pqueue is in a
messed up state.

Actually that method is somewhat awkward.  I wonder if, instead, we
could define an Object getSentinelObject(), returning null by default,
and the pqueue on creation would call that and if it's non-null, fill
the queue (by calling it maxSize times)?

bq. Could be useful - but then we should probably do it on DocIdSetIterator 
with default impl, and override where it makes sense (BS and BS2)? Also, if we 
do this, why not adding an end() too, allowing a DISI to release resources?

Actually shouldn't Weight.scorer(...) in general be the place
where such pre-next() initializatoin is done?  EG
BooleanWeight.scorer(...) should call BS2's initCountingSumScorer
(and/or somehow forward to BS)?

bq. Yes I kept BS and BS2 in mind ... if we condiionalize anything, it means 
extra 'if'. If we want to avoid that 'if', we need to create a variant of the 
class, which might not be so bad in TSDC, but will look awful in TFC 
(additional 6 classes).

Yeah that's (the * 2 splintering) is what I was fearing.  At some
point we should leave this splintering to source code
specialization...it's getting somewhat crazy now.

bq. Perhaps we should still attempt to add to PQ if cmp == 0?

That basically undoes the don't fallback to docID optimization
right?

bq. Or did you have something else in mind?

The 6 new classes is what I feared we'd need to do.  Else, with the
changes here (that never break ties by docID), TopFieldCollector can't
be used with BooleanScorer (which breaks back compat).

I guess since the 6 classes are hidden under the
TopFieldCollector.create it's maybe not so bad?


 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703332#action_12703332
 ] 

Mark Miller commented on LUCENE-1616:
-

bq. The only thing that fails is contrib, but I guess this has nothing to do 
with it?

looks like an issue with highlighters dependency on memory index. what target 
produces the problem? We have seen something like it in the past.

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch, 
 LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1613) TermEnum.docFreq() is not updated with there are deletes

2009-04-27 Thread Matt Chaput (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1270#action_1270
 ] 

Matt Chaput edited comment on LUCENE-1613 at 4/27/09 1:11 PM:
--

Given how fundamental the issue is w.r.t. how Lucene stores the index, it's 
unlikely to ever be fixed. (A clean, performant fix other than simply merging 
the segments would be a pretty incredible revelation.) As an outside observer I 
would argue against keeping the bug open forever for correctness sake.



  was (Author: mchaput):
Given how fundamental the issue is w.r.t. how Lucene stores the index, it's 
unlikely to ever be fixed. (A clean, performant fix other than simply merging 
the segments would be pretty incredible revelation.) As an outside observer I 
would argue against keeping the bug open forever for correctness sake.


  
 TermEnum.docFreq() is not updated with there are deletes
 

 Key: LUCENE-1613
 URL: https://issues.apache.org/jira/browse/LUCENE-1613
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.4
Reporter: John Wang
 Attachments: TestDeleteAndDocFreq.java


 TermEnum.docFreq is used in many places, especially scoring. However, if 
 there are deletes in the index and it is not yet merged, this value is not 
 updated.
 Attached is a test case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1613) TermEnum.docFreq() is not updated with there are deletes

2009-04-27 Thread Matt Chaput (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1270#action_1270
 ] 

Matt Chaput commented on LUCENE-1613:
-

Given how fundamental the issue is w.r.t. how Lucene stores the index, it's 
unlikely to ever be fixed. (A clean, performant fix other than simply merging 
the segments would be pretty incredible revelation.) As an outside observer I 
would argue against keeping the bug open forever for correctness sake.



 TermEnum.docFreq() is not updated with there are deletes
 

 Key: LUCENE-1613
 URL: https://issues.apache.org/jira/browse/LUCENE-1613
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.4
Reporter: John Wang
 Attachments: TestDeleteAndDocFreq.java


 TermEnum.docFreq is used in many places, especially scoring. However, if 
 there are deletes in the index and it is not yet merged, this value is not 
 updated.
 Attached is a test case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703337#action_12703337
 ] 

Michael McCandless commented on LUCENE-1616:


bq. I use patched versions of SpanNear/OrQueries, which still use backing 
ArrayList, but accept it in constructor, have utility 'add' method and 
getClauses() returns this very list, which allows for zero-cost inspection and 
easy modification if the need arises.

That sounds useful -- is it something you can share?

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch, 
 LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703335#action_12703335
 ] 

Eks Dev commented on LUCENE-1616:
-

ant build-contrib 

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch, 
 LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1613) TermEnum.docFreq() is not updated with there are deletes

2009-04-27 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703338#action_12703338
 ] 

Mark Miller commented on LUCENE-1613:
-

This is a dupe I believe, but for the life of me, I cannot find the original to 
link them.

 TermEnum.docFreq() is not updated with there are deletes
 

 Key: LUCENE-1613
 URL: https://issues.apache.org/jira/browse/LUCENE-1613
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.4
Reporter: John Wang
 Attachments: TestDeleteAndDocFreq.java


 TermEnum.docFreq is used in many places, especially scoring. However, if 
 there are deletes in the index and it is not yet merged, this value is not 
 updated.
 Attached is a test case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-27 Thread Jason Rutherglen (JIRA)
Allow setting the IndexWriter docstore to be a different directory
--

 Key: LUCENE-1618
 URL: https://issues.apache.org/jira/browse/LUCENE-1618
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9


Add an IndexWriter.setDocStoreDirectory method that allows doc
stores to be placed in a different directory than the IW default
dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1616:
--

Assignee: Michael McCandless

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch, 
 LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1619) TermAttribute.termLength() optimization

2009-04-27 Thread Eks Dev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1619:


Attachment: LUCENE-1619.patch

 TermAttribute.termLength() optimization
 ---

 Key: LUCENE-1619
 URL: https://issues.apache.org/jira/browse/LUCENE-1619
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Attachments: LUCENE-1619.patch


public int termLength() {
  initTermBuffer(); // This patch removes this method call 
  return termLength;
}
 I see no reason to initTermBuffer() in termLength()... all tests pass, but I 
 could be wrong?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1617) Add testpackage to common-build.xml

2009-04-27 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703341#action_12703341
 ] 

Shai Erera commented on LUCENE-1617:


Ok, so I've done some research, and I'm really puzzled. Everywhere I read, it 
is mentioned that batchtest uses fileset to include test cases, and that 
you should include them using a pattern like **/Test*.java ... which is what 
is done already if none of the special test modes is specified (a single 
test, a package or package-root).

However, for some reason if the definition looks like this, those non-TestCase 
classes are filtered out / skipped:
{code}
fileset dir=src/test includes=**/Test*.java,**/*Tets.java /
{code}

But if the definition looks like this, they are executed, which results in a 
failure:
{code}
fileset dir=src/test includes=**/lucene/Test*.java,**/lucene/*Tets.java /
{code}

As if the batchtest task behaves differently when the definition of includes 
contains a different pattern than the first one. I also tried to modify the 
dir attribute, to define src/test/org/apache/lucene, but that doesn't seem 
to solve the problem.

So the only thing I can think of is to rename those classes to not start/end 
with Test? I'd hate to lose the ability to test an entire package, just 
because of that limitation. By running ant test-core -Dtestpackage=lucene I 
can discover all the non-test classes that start/end with Test.

What do you think?

 Add testpackage to common-build.xml
 -

 Key: LUCENE-1617
 URL: https://issues.apache.org/jira/browse/LUCENE-1617
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Build
Reporter: Shai Erera
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1617.patch, LUCENE-1617.patch


 One can define testcase to execute just one test class, which is 
 convenient. However, I didn't notice any equivalent for testing a whole 
 package. I find it convenient to be able to test packages rather than test 
 cases because often it is not so clear which test class to run.
 Following patch allows one to ant test -Dtestpackage=search (for example) 
 and run all tests under the \*/search/\* packages in core, contrib and tags, 
 or do ant test-core -Dtestpackage=search and execute similarly just for 
 core, or do ant test-core -Dtestpacakge=lucene/search/function and run all 
 the tests under \*/lucene/search/function/\* (just in case there is another 
 o.a.l.something.search.function package out there which we want to exclude.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1619) TermAttribute.termLength() optimization

2009-04-27 Thread Eks Dev (JIRA)
TermAttribute.termLength() optimization
---

 Key: LUCENE-1619
 URL: https://issues.apache.org/jira/browse/LUCENE-1619
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Priority: Trivial
 Attachments: LUCENE-1619.patch


   public int termLength() {
 initTermBuffer(); // This patch removes this method call 
 return termLength;
   }

I see no reason to initTermBuffer() in termLength()... all tests pass, but I 
could be wrong?



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1617) Add testpackage to common-build.xml

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703346#action_12703346
 ] 

Michael McCandless commented on LUCENE-1617:


bq. So the only thing I can think of is to rename those classes to not 
start/end with Test?

I think this is an OK workaround for the ant spookiness?  (We could also ask 
our resident ant expert to figure it out ;) )

I think these classes are quite old and probably never used by anyone anymore.

 Add testpackage to common-build.xml
 -

 Key: LUCENE-1617
 URL: https://issues.apache.org/jira/browse/LUCENE-1617
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Build
Reporter: Shai Erera
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1617.patch, LUCENE-1617.patch


 One can define testcase to execute just one test class, which is 
 convenient. However, I didn't notice any equivalent for testing a whole 
 package. I find it convenient to be able to test packages rather than test 
 cases because often it is not so clear which test class to run.
 Following patch allows one to ant test -Dtestpackage=search (for example) 
 and run all tests under the \*/search/\* packages in core, contrib and tags, 
 or do ant test-core -Dtestpackage=search and execute similarly just for 
 core, or do ant test-core -Dtestpacakge=lucene/search/function and run all 
 the tests under \*/lucene/search/function/\* (just in case there is another 
 o.a.l.something.search.function package out there which we want to exclude.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1617) Add testpackage to common-build.xml

2009-04-27 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703348#action_12703348
 ] 

Shai Erera commented on LUCENE-1617:


bq. I think these classes are quite old and probably never used by anyone 
anymore.

Then perhaps I just delete them? :D

If that's not acceptable, I'll run all the tests in core and contrib and rename 
those that fail. But deleting them really tickles the tip of my fingers !

 Add testpackage to common-build.xml
 -

 Key: LUCENE-1617
 URL: https://issues.apache.org/jira/browse/LUCENE-1617
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Build
Reporter: Shai Erera
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1617.patch, LUCENE-1617.patch


 One can define testcase to execute just one test class, which is 
 convenient. However, I didn't notice any equivalent for testing a whole 
 package. I find it convenient to be able to test packages rather than test 
 cases because often it is not so clear which test class to run.
 Following patch allows one to ant test -Dtestpackage=search (for example) 
 and run all tests under the \*/search/\* packages in core, contrib and tags, 
 or do ant test-core -Dtestpackage=search and execute similarly just for 
 core, or do ant test-core -Dtestpacakge=lucene/search/function and run all 
 the tests under \*/lucene/search/function/\* (just in case there is another 
 o.a.l.something.search.function package out there which we want to exclude.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1617) Add testpackage to common-build.xml

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703356#action_12703356
 ] 

Michael McCandless commented on LUCENE-1617:


Actually I think deleting them is a good idea!

Does anyone object?

 Add testpackage to common-build.xml
 -

 Key: LUCENE-1617
 URL: https://issues.apache.org/jira/browse/LUCENE-1617
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Build
Reporter: Shai Erera
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1617.patch, LUCENE-1617.patch


 One can define testcase to execute just one test class, which is 
 convenient. However, I didn't notice any equivalent for testing a whole 
 package. I find it convenient to be able to test packages rather than test 
 cases because often it is not so clear which test class to run.
 Following patch allows one to ant test -Dtestpackage=search (for example) 
 and run all tests under the \*/search/\* packages in core, contrib and tags, 
 or do ant test-core -Dtestpackage=search and execute similarly just for 
 core, or do ant test-core -Dtestpacakge=lucene/search/function and run all 
 the tests under \*/lucene/search/function/\* (just in case there is another 
 o.a.l.something.search.function package out there which we want to exclude.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: new TokenStream api Question

2009-04-27 Thread eks dev

Should I create a patch with something like this? 

With Expert javadoc, and explanation what is this good for should be a nice 
addition to Attribute cases.
Practically, it would enable specialization of hard linked Attributes like 
TermAttribute. 

The only preconditions are: 

- Specialized Attribute must extend one of the hard linked ones, and 
provide class of it
- Must implement default constructor 
- should extend by not introducing state (big majority of cases) (not to break 
captureState())

The last one could be relaxed i guess, but I am not yet 100% familiar with this 
code.

Use cases for this are along the lines of my example, smaller, easier user code 
and performance (token filters mainly)



- Original Message 
 From: Uwe Schindler u...@thetaphi.de
 To: java-dev@lucene.apache.org
 Sent: Sunday, 26 April, 2009 23:03:06
 Subject: RE: new TokenStream api Question
 
 There is one problem: if you extend TermAttribute, the class is different
 (which is the key in the attributes list). So when you initialize the
 TokenStream and do a
 
 YourClass termAtt = (YourClass) addAttribute(YourClass.class)
 
 ...you create a new attribute. So one possibility would be to also specify
 the instance and save the attribute by class (as key), but with your
 instance. If you are the first one that creates the attribute (if it is a
 token stream and not a filter it is ok, you will be the first, it adding the
 attribute in the ctor), everything is ok. Register the attribute by yourself
 (maybe we should add a specialized addAttribute, that can specify a instance
 as default)?:
 
 YourClass termAtt = new YourClass();
 attributes.put(TermAttribute.class, termAtt);
 
 In this case, for the indexer it is a standard TermAttribute, but you can
 more with it.
 
 Replacing TermAttribute by an own class is not possible, as the indexer will
 get a ClassCastException when using the instance retrieved with
 getAttribute(TermAttribute.class).
 
 Uwe
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
  -Original Message-
  From: eks dev [mailto:eks...@yahoo.co.uk]
  Sent: Sunday, April 26, 2009 10:39 PM
  To: java-dev@lucene.apache.org
  Subject: new TokenStream api Question
  
  
  I am just looking into new TermAttribute usage and wonder what would be
  the best way to implement PrefixFilter that would filter out some Terms
  that have some prefix,
  
  something like this, where '-' represents my prefix:
  
public final boolean incrementToken() throws IOException {
  // the first word we found
  while (input.incrementToken()) {
int len = termAtt.termLength();
  
if(len  0  termAtt.termBuffer()[0]!='-') //only length  0 and
  non LFs
  return true;
// note: else we ignore it
  }
  // reached EOS
  return false;
}
  
  
  
  
  
  The question would be:
  
  can I extend TermAttribute and add boolean startsWith(char c);
  
  The point is speed and my code gets smaller.
  TermAttribute has one method called in termLength() and termBuffer() I do
  not understand (back compatibility, I guess)
public int termLength() {
  initTermBuffer(); // I'd like to avoid it...
  return termLength;
}
  
  
  I'd like to get rid of initTermBuffer(), the first option is to *extend*
  TermAttribute code (but fields are private, so no help there) or can I
  implement my own MyTermAttribute (will Indexer know how to deal with it?)
  
  Must I extend TermAttribute or I can add my own?
  
  thanks,
  eks
  
  
  
  
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



  

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-27 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703362#action_12703362
 ] 

Shai Erera commented on LUCENE-1593:


bq. I wonder if, instead, we could define an Object getSentinelObject(), 
returning null by default, and the pqueue on creation would call that and if 
it's non-null, fill the queue (by calling it maxSize times)?

Some extensions of PQ may not know how to construct such a sentinel object. 
Consider ComparablePQ, which assumes all items are Comparable. Unlike HitQueue, 
it does not know what will be the items stored in the queue. But ... I guess it 
can return a Comparable which always prefers the other element ... so maybe 
that's not a good example.
I just have the feeling that a setter method will give us more freedom, rather 
than having to extend PQ just for that ... 

bq. Actually shouldn't Weight.scorer(...) in general be the place where 
such pre-next() initializatoin is done?

Ok - so BS's add() is only called from BS2.score(Collector). Therefore BS can 
be initialized from BS2 directly. Since both are package-private, we should be 
safe.
BS2's add() is called from BooleanWeight.scorer() (I'm sorry if I repeat what 
you wrote above, but that's just me learning the code), and so can be 
initialized from there ... hmm I wonder why this wasn't done so far?

I'll give it a try.

bq. That basically undoes the don't fallback to docID optimization right?

Right ... it's too late for me :)

bq. I guess since the 6 classes are hidden under the TopFieldCollector.create 
it's maybe not so bad?

It's just that maintaining that class becomes more and more problematic. It 
already contains 6 inner classes, which duplicate the code to avoid 'if' 
statements. Meaning every time a bug is found, all 6 need to be checked and 
fixed. With that proposal, it means 12 ...

But I wonder from where would we control it ... IndexSearcher no longer has a 
ctor which allows to define whether docs should be collected in order or not 
(the patch removes it). The only other place where it's defined is in 
BooleanQuery's static setter (affects all boolean queries). But BooleanWeight 
receives the Collector, and does not create it ...
So, if we check in IndexSearcher's search() methods whether this parameter is 
set or not, we can control the creation of TSDC and TFC. And if anyone else 
instantiates them on his own, he should know whether he executes searches 
in-order or not. Back-compat-wise, TFC and TSDC are still in trunk and haven't 
been released, so we shouldn't have a problem right?

Does that sound like a good approach? I still hate to duplicate the code in 
TFC, but I don't think there's any other choice. Maybe create completely 
separate classes for TFC and TSDC? although that will make code maintenance 
even harder.

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-27 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703363#action_12703363
 ] 

Yonik Seeley commented on LUCENE-1618:
--

I can see how this would potentially be useful for realtime... but it seems 
like only IndexWriter could eventually fix the situation of having the docstore 
on disk and the rest of a segment in RAM.  Which means that this API shouldn't 
be public?

 Allow setting the IndexWriter docstore to be a different directory
 --

 Key: LUCENE-1618
 URL: https://issues.apache.org/jira/browse/LUCENE-1618
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

   Original Estimate: 336h
  Remaining Estimate: 336h

 Add an IndexWriter.setDocStoreDirectory method that allows doc
 stores to be placed in a different directory than the IW default
 dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1617) Add testpackage to common-build.xml

2009-04-27 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703365#action_12703365
 ] 

Shai Erera commented on LUCENE-1617:


As reference, I ran test-core and test-contrib and these are the problematic 
classes (from core only):
* Test org.apache.lucene.AnalysisTest
* Test org.apache.lucene.IndexTest
* Test org.apache.lucene.SearchTest
* Test org.apache.lucene.StoreTest
* Test org.apache.lucene.ThreadSafetyTest

 Add testpackage to common-build.xml
 -

 Key: LUCENE-1617
 URL: https://issues.apache.org/jira/browse/LUCENE-1617
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Build
Reporter: Shai Erera
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1617.patch, LUCENE-1617.patch


 One can define testcase to execute just one test class, which is 
 convenient. However, I didn't notice any equivalent for testing a whole 
 package. I find it convenient to be able to test packages rather than test 
 cases because often it is not so clear which test class to run.
 Following patch allows one to ant test -Dtestpackage=search (for example) 
 and run all tests under the \*/search/\* packages in core, contrib and tags, 
 or do ant test-core -Dtestpackage=search and execute similarly just for 
 core, or do ant test-core -Dtestpacakge=lucene/search/function and run all 
 the tests under \*/lucene/search/function/\* (just in case there is another 
 o.a.l.something.search.function package out there which we want to exclude.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703366#action_12703366
 ] 

Michael McCandless commented on LUCENE-1313:


{quote}
Agreed that DW can write the segment to the RAMDir. I started
coding along these lines however what do we do about the RAMDir
merging? This is why I was thinking we'll need a separate IW?
Otherwise the ram segments (if they are treated the same as disk
segments) would quickly be merged to disk? Or we have two
separate merging paths?
{quote}

Hmm, right.  We could exclude RAMDir segments from consideration by
MergePolicy?  Alternatively, we could expect the MergePolicy to
recognize this and be smart about choosing merges (ie don't mix
merges)?

EG we do in fact want some merging of the RAM segments if they get too
numerous (since that will impact search performance).

{quote}
 we should make with NRT is to not close the doc store
 (stored fields, term vector) files when flushing for an NRT
 reader.

Agreed, I think this feature is a must otherwise we're doing
unnecessary in ram merging.
{quote}

OK, let's do this as a separate issue/optimization for NRT.  There are
two separate parts to it:

  * Ability to store doc stores in real directory (looks like you
opened LUCENE-1618 for this part).
 
  * Ability to share IndexOutput  IndexInput

{quote}
I ran into problems with this before, I was trying to reuse
Directory to write a transaction log. It seemed theoretically
doable however it didn't work in practice. It could have been
the seeking and replacing but I don't remember. FSIndexOutput
uses a writeable RAF and FSIndexInput is read only why would
there be an issue?
{quote}

Hmm... seems like we need to investigate further.  We could either
ask an IndexOutput for its IndexInput (sharing the underlying RAF),
or try to separately open an IndexInput (which may not work on
Windows).

{quote}
To implement this functionality in parallel (and perhaps make
the overall patch cleaner), writing doc stores directly to a
separate directory can be a different patch? There can be an
option IW.setDocStoresDirectory(Directory) that the patch
implements? Then some unit tests that are separate from the near
realtime portion.
{quote}

Yes, separate issue (LUCENE-1618).


 Realtime Search
 ---

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch


 Realtime search with transactional semantics.  
 Possible future directions:
   * Optimistic concurrency
   * Replication
 Encoding each transaction into a set of bytes by writing to a RAMDirectory 
 enables replication.  It is difficult to replicate using other methods 
 because while the document may easily be serialized, the analyzer cannot.
 I think this issue can hold realtime benchmarks which include indexing and 
 searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1597) New Document and Field API

2009-04-27 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch reassigned LUCENE-1597:
-

Assignee: Michael Busch

 New Document and Field API
 --

 Key: LUCENE-1597
 URL: https://issues.apache.org/jira/browse/LUCENE-1597
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Attachments: lucene-new-doc-api.patch


 This is a super rough prototype of how a new document API could look like. 
 It's basically what I came up with during a long flight across the Atlantic :)
 It is not integrated with anything yet (like IndexWriter, DocumentsWriter, 
 etc.) and heavily uses Java 1.5 features, such as generics and annotations.
 The general idea sounds similar to what Marvin is doing in KS, which I found 
 out by reading Mike's comments on LUCENE-831, I haven't looked at the KS API 
 myself yet. 
 Main ideas:
 - separate a field's value from its configuration; therefore this patch 
 introduces two classes: FieldDescriptor and FieldValue
 - I was thinking that in most cases the documents people add to a Lucene 
 index look alike, i.e. they contain mostly the same fields with the same 
 settings. Yet, for every field instance the DocumentsWriter checks the 
 settings and calls the right consumers, which themselves check settings and 
 return true or false, indicating whether or not they want to do something 
 with that field or not. So I was thinking we could design the document API 
 similar to the Class-Object concept of OO-languages. There a class is a 
 blueprint (as everyone knows :) ), and an object is one instance of it. So in 
 this patch I introduced a class called DocumentDescriptor, which contains all 
 FieldDescriptors with the field settings. This descriptor is given to the 
 consumer (IndexWriter) once in the constructor. Then the Document instances 
 are created and added via addDocument().
 - A Document instance allows adding variable fields in addition to the 
 fixed fields the DocumentDescriptor contains. For these fields the 
 consumers have to check the field settings for every document instance (like 
 with the old document API). This is for maintaining Lucene's flexibility that 
 everyone loves.
 - Disregard the changes to AttributeSource for now. The code that's worth 
 looking at is contained in a new package newdoc.
 Again, this is not a real patch, but rather a demo of how a new API could 
 roughly work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1597) New Document and Field API

2009-04-27 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703367#action_12703367
 ] 

Michael Busch commented on LUCENE-1597:
---

Thanks for the thorough review, Mike. Reading your response made me really 
excited, because you exactly understood most of the thoughts I put into this 
code, without me even mentioning them :) Thanks for writing them down!

I started including your suggestions into my patch and will reply with more 
detail to your individual points as I'm working on them.

 New Document and Field API
 --

 Key: LUCENE-1597
 URL: https://issues.apache.org/jira/browse/LUCENE-1597
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael Busch
Priority: Minor
 Attachments: lucene-new-doc-api.patch


 This is a super rough prototype of how a new document API could look like. 
 It's basically what I came up with during a long flight across the Atlantic :)
 It is not integrated with anything yet (like IndexWriter, DocumentsWriter, 
 etc.) and heavily uses Java 1.5 features, such as generics and annotations.
 The general idea sounds similar to what Marvin is doing in KS, which I found 
 out by reading Mike's comments on LUCENE-831, I haven't looked at the KS API 
 myself yet. 
 Main ideas:
 - separate a field's value from its configuration; therefore this patch 
 introduces two classes: FieldDescriptor and FieldValue
 - I was thinking that in most cases the documents people add to a Lucene 
 index look alike, i.e. they contain mostly the same fields with the same 
 settings. Yet, for every field instance the DocumentsWriter checks the 
 settings and calls the right consumers, which themselves check settings and 
 return true or false, indicating whether or not they want to do something 
 with that field or not. So I was thinking we could design the document API 
 similar to the Class-Object concept of OO-languages. There a class is a 
 blueprint (as everyone knows :) ), and an object is one instance of it. So in 
 this patch I introduced a class called DocumentDescriptor, which contains all 
 FieldDescriptors with the field settings. This descriptor is given to the 
 consumer (IndexWriter) once in the constructor. Then the Document instances 
 are created and added via addDocument().
 - A Document instance allows adding variable fields in addition to the 
 fixed fields the DocumentDescriptor contains. For these fields the 
 consumers have to check the field settings for every document instance (like 
 with the old document API). This is for maintaining Lucene's flexibility that 
 everyone loves.
 - Disregard the changes to AttributeSource for now. The code that's worth 
 looking at is contained in a new package newdoc.
 Again, this is not a real patch, but rather a demo of how a new API could 
 roughly work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-27 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703368#action_12703368
 ] 

Earwin Burrfoot commented on LUCENE-1593:
-

Use FMPP? It is pretty nice and integrates well into maven/ant builds. I'm 
using it for primitive-specialized fieldcaches.

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703370#action_12703370
 ] 

Michael McCandless commented on LUCENE-1618:


Yeah I also think this should be an under the hood (done only by NRT) 
optimization inside IndexWriter.

The only possible non-NRT case I can think of is when users make temporary 
indices in RAM, it's possible one would want to write the docStore files to an 
FSDirectory (because they are so large) but keep postings, norms, deletes, etc 
in RAM.  But going down that road opens up a can of worms... eg does segments_N 
somehow have to keep track of which dir has which parts of a segment?  Suddenly 
IndexReader must also know to look in different dirs for different parts of a 
segment, etc.

it might be cleaner to make a Directory impl that dispatches certain files to a 
RAMDir and others to an FSDir, so IndexWriter/IndexReader still see a single 
Directory API.

 Allow setting the IndexWriter docstore to be a different directory
 --

 Key: LUCENE-1618
 URL: https://issues.apache.org/jira/browse/LUCENE-1618
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

   Original Estimate: 336h
  Remaining Estimate: 336h

 Add an IndexWriter.setDocStoreDirectory method that allows doc
 stores to be placed in a different directory than the IW default
 dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1617) Add testpackage to common-build.xml

2009-04-27 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1617:
---

Attachment: LUCENE-1617.patch

This one removes the aforementioned test classes (that are not really tests), 
in case everybody agrees.

 Add testpackage to common-build.xml
 -

 Key: LUCENE-1617
 URL: https://issues.apache.org/jira/browse/LUCENE-1617
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Build
Reporter: Shai Erera
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1617.patch, LUCENE-1617.patch, LUCENE-1617.patch


 One can define testcase to execute just one test class, which is 
 convenient. However, I didn't notice any equivalent for testing a whole 
 package. I find it convenient to be able to test packages rather than test 
 cases because often it is not so clear which test class to run.
 Following patch allows one to ant test -Dtestpackage=search (for example) 
 and run all tests under the \*/search/\* packages in core, contrib and tags, 
 or do ant test-core -Dtestpackage=search and execute similarly just for 
 core, or do ant test-core -Dtestpacakge=lucene/search/function and run all 
 the tests under \*/lucene/search/function/\* (just in case there is another 
 o.a.l.something.search.function package out there which we want to exclude.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-27 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703373#action_12703373
 ] 

Shai Erera commented on LUCENE-1593:


Forgive my ignorance, but what is FMPP? And to which of the above is it 
related? :)

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703374#action_12703374
 ] 

Michael McCandless commented on LUCENE-1616:


OK all tests pass.  I had to fix a few back-compat tests (that were using the 
new TokenStream API, I think because we created the back-compat branch from 
trunk after the new TokenStream API landed).

I'll commit in a day or two.  Thanks Eks!

 add one setter for start and end offset to OffsetAttribute
 --

 Key: LUCENE-1616
 URL: https://issues.apache.org/jira/browse/LUCENE-1616
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Eks Dev
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch, 
 LUCENE-1616.patch


 add OffsetAttribute. setOffset(startOffset, endOffset);
 trivial change, no JUnit needed
 Changed CharTokenizer to use it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-27 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703375#action_12703375
 ] 

Jason Rutherglen commented on LUCENE-1618:
--

{quote}
non-NRT case I can think of is when users make temporary indices in RAM
{quote}

Yes, and there could be others we don't know about.  

{quote}
it might be cleaner to make a Directory impl that dispatches certain files to a 
RAMDir and others to an FSDir
{quote}

Good idea.  I'll try that method first.  If this one works out, then the API 
will be public?

 Allow setting the IndexWriter docstore to be a different directory
 --

 Key: LUCENE-1618
 URL: https://issues.apache.org/jira/browse/LUCENE-1618
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

   Original Estimate: 336h
  Remaining Estimate: 336h

 Add an IndexWriter.setDocStoreDirectory method that allows doc
 stores to be placed in a different directory than the IW default
 dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703377#action_12703377
 ] 

Michael McCandless commented on LUCENE-1313:


So let's leave this issue focused on sometimes using RAMDir for newly created 
segments.

 Realtime Search
 ---

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch


 Realtime search with transactional semantics.  
 Possible future directions:
   * Optimistic concurrency
   * Replication
 Encoding each transaction into a set of bytes by writing to a RAMDirectory 
 enables replication.  It is difficult to replicate using other methods 
 because while the document may easily be serialized, the analyzer cannot.
 I think this issue can hold realtime benchmarks which include indexing and 
 searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1617) Add testpackage to common-build.xml

2009-04-27 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1617:
--

Assignee: Michael McCandless

 Add testpackage to common-build.xml
 -

 Key: LUCENE-1617
 URL: https://issues.apache.org/jira/browse/LUCENE-1617
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Build
Reporter: Shai Erera
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1617.patch, LUCENE-1617.patch, LUCENE-1617.patch


 One can define testcase to execute just one test class, which is 
 convenient. However, I didn't notice any equivalent for testing a whole 
 package. I find it convenient to be able to test packages rather than test 
 cases because often it is not so clear which test class to run.
 Following patch allows one to ant test -Dtestpackage=search (for example) 
 and run all tests under the \*/search/\* packages in core, contrib and tags, 
 or do ant test-core -Dtestpackage=search and execute similarly just for 
 core, or do ant test-core -Dtestpacakge=lucene/search/function and run all 
 the tests under \*/lucene/search/function/\* (just in case there is another 
 o.a.l.something.search.function package out there which we want to exclude.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1617) Add testpackage to common-build.xml

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703381#action_12703381
 ] 

Michael McCandless commented on LUCENE-1617:


OK looks good.  I'll wait a day or two before committing.  Thanks Shai!

 Add testpackage to common-build.xml
 -

 Key: LUCENE-1617
 URL: https://issues.apache.org/jira/browse/LUCENE-1617
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Build
Reporter: Shai Erera
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1617.patch, LUCENE-1617.patch, LUCENE-1617.patch


 One can define testcase to execute just one test class, which is 
 convenient. However, I didn't notice any equivalent for testing a whole 
 package. I find it convenient to be able to test packages rather than test 
 cases because often it is not so clear which test class to run.
 Following patch allows one to ant test -Dtestpackage=search (for example) 
 and run all tests under the \*/search/\* packages in core, contrib and tags, 
 or do ant test-core -Dtestpackage=search and execute similarly just for 
 core, or do ant test-core -Dtestpacakge=lucene/search/function and run all 
 the tests under \*/lucene/search/function/\* (just in case there is another 
 o.a.l.something.search.function package out there which we want to exclude.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-27 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703382#action_12703382
 ] 

Earwin Burrfoot commented on LUCENE-1593:
-

bq. Forgive my ignorance, but what is FMPP?
Forgive my laziness, http://fmpp.sourceforge.net/ - What is FMPP? FMPP is a 
general-purpose text file preprocessor tool that uses FreeMarker templates.

bq. And to which of the above is it related?
to this
bq. It's just that maintaining that class becomes more and more problematic. It 
already contains 6 inner classes, which duplicate the code to avoid 'if' 
statements. Meaning every time a bug is found, all 6 need to be checked and 
fixed. With that proposal, it means 12 ...

Mike experimented with generated code for specialized search, I see no reasons 
not to use the same approach for cases where you're already hand-coding N 
almost-identical classes. You're generating query parser after all :)
For official release FMPP is superior to Python as it can be bundled in a 
crossplatform manner.

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-27 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703384#action_12703384
 ] 

Tim Smith commented on LUCENE-1618:
---

Would also further suggest that this Directory implementation would take one or 
more directories to store documents, along with one or more directories to 
store the index itself

one of the directories should be explicitly marked for reading for each use

this allows creating a Directory instance that will:
* store documents to disk (reading from disk during searches)
* write index to disk and ram (reading from RAM during searches)

 Allow setting the IndexWriter docstore to be a different directory
 --

 Key: LUCENE-1618
 URL: https://issues.apache.org/jira/browse/LUCENE-1618
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

   Original Estimate: 336h
  Remaining Estimate: 336h

 Add an IndexWriter.setDocStoreDirectory method that allows doc
 stores to be placed in a different directory than the IW default
 dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-04-27 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703387#action_12703387
 ] 

Jason Rutherglen commented on LUCENE-1313:
--

{quote} We could exclude RAMDir segments from consideration by
MergePolicy? Alternatively, we could expect the MergePolicy to
recognize this and be smart about choosing merges (ie don't mix
merges)? {quote}

Is this over complicating things? Sometimes we want a mixture of
RAMDir segments and FSDir segments to merge (when we've decided
we have too much in ram), sometimes we don't (when we just want
the ram segments to merge). I'm still a little confused as to
why having a wrapper class that manages a disk writer and a ram
writer isn't cleaner?  

 Realtime Search
 ---

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch


 Realtime search with transactional semantics.  
 Possible future directions:
   * Optimistic concurrency
   * Replication
 Encoding each transaction into a set of bytes by writing to a RAMDirectory 
 enables replication.  It is difficult to replicate using other methods 
 because while the document may easily be serialized, the analyzer cannot.
 I think this issue can hold realtime benchmarks which include indexing and 
 searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703388#action_12703388
 ] 

Michael McCandless commented on LUCENE-1618:


{quote}
 it might be cleaner to make a Directory impl that dispatches certain files to 
 a RAMDir and others to an FSDir

Good idea. I'll try that method first. If this one works out, then the API will 
be public?
{quote}

Which API would be public?

If this (call it FileSwitchDirectory for now ;) ) works then we would not add 
any API to IndexWriter (ie it's either or)?  But FileSwitchDirectory would be 
public  expert.

One downside to this approach is it's brittle -- whenever we change file 
extensions you'd have to know to fix this Directory.  Or maybe we make the 
Directory specialized to only storing the doc stores in the FSDir, then 
whenever we change file formats we would fix this directory?  But in the 
future, with custom codecs, things could be named whatever... hmmm.  Lacking 
clarity.

 Allow setting the IndexWriter docstore to be a different directory
 --

 Key: LUCENE-1618
 URL: https://issues.apache.org/jira/browse/LUCENE-1618
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

   Original Estimate: 336h
  Remaining Estimate: 336h

 Add an IndexWriter.setDocStoreDirectory method that allows doc
 stores to be placed in a different directory than the IW default
 dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1597) New Document and Field API

2009-04-27 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703391#action_12703391
 ] 

Michael Busch commented on LUCENE-1597:
---

{quote}
How would you turn on/off [future] CSF storage? A separate attr? A
boolean on StoreAttribute?
{quote}

I was thinking about adding a separate attribute. But here is one
thing I haven't figured out yet: it should actually be perfectly fine
to store a value in a CSF and *also* in the 'normal' store. The
problem is that the type of data input is the limiting factor here: if
the user provides the data as a byte array, then everything works
fine. However, if the data is provide as a Reader, then it's not
guaranteed that the reader can be read more than once. To implement
reset() is optional, as the javadocs say.

So maybe we should state in our javadocs that a reader must support
reset(), otherwise writing the data into more than one data structures
will result in an undefined behavior? Alternatively we could introduce
a new class called ResetableReader, where reset() is abstract, and
change the API in 3.0 to only accept that type of reader?

Btw. the same is true for fields that provide the data as a
TokenStream. 

 New Document and Field API
 --

 Key: LUCENE-1597
 URL: https://issues.apache.org/jira/browse/LUCENE-1597
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Attachments: lucene-new-doc-api.patch


 This is a super rough prototype of how a new document API could look like. 
 It's basically what I came up with during a long flight across the Atlantic :)
 It is not integrated with anything yet (like IndexWriter, DocumentsWriter, 
 etc.) and heavily uses Java 1.5 features, such as generics and annotations.
 The general idea sounds similar to what Marvin is doing in KS, which I found 
 out by reading Mike's comments on LUCENE-831, I haven't looked at the KS API 
 myself yet. 
 Main ideas:
 - separate a field's value from its configuration; therefore this patch 
 introduces two classes: FieldDescriptor and FieldValue
 - I was thinking that in most cases the documents people add to a Lucene 
 index look alike, i.e. they contain mostly the same fields with the same 
 settings. Yet, for every field instance the DocumentsWriter checks the 
 settings and calls the right consumers, which themselves check settings and 
 return true or false, indicating whether or not they want to do something 
 with that field or not. So I was thinking we could design the document API 
 similar to the Class-Object concept of OO-languages. There a class is a 
 blueprint (as everyone knows :) ), and an object is one instance of it. So in 
 this patch I introduced a class called DocumentDescriptor, which contains all 
 FieldDescriptors with the field settings. This descriptor is given to the 
 consumer (IndexWriter) once in the constructor. Then the Document instances 
 are created and added via addDocument().
 - A Document instance allows adding variable fields in addition to the 
 fixed fields the DocumentDescriptor contains. For these fields the 
 consumers have to check the field settings for every document instance (like 
 with the old document API). This is for maintaining Lucene's flexibility that 
 everyone loves.
 - Disregard the changes to AttributeSource for now. The code that's worth 
 looking at is contained in a new package newdoc.
 Again, this is not a real patch, but rather a demo of how a new API could 
 roughly work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-27 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703396#action_12703396
 ] 

Shai Erera commented on LUCENE-1593:


bq.EG BooleanWeight.scorer(...) should call BS2's initCountingSumScorer (and/or 
somehow forward to BS)?

Ok that somehow forward to BS is more problematic than I thought so 
initially. BS2.score(Collector) determines whether to instantiate a new BS, add 
the scorers and call bs.score(Collector), or to execute the score itself. On 
the other hand, it uses the same scorers in next() and skipTo(). Therefore 
there's kind of a mutual exclusiveness here: either the scorers are used by BS 
or by BS2. They cannot be used by both, unless we clone() them. If we want to 
clone them, we need to:
* Create a BS in init().
* Clone all the Scorers and pass them to BS.
* Initialize BS2's countingSumScorer.
* In score(Collector) use the class member of BS.

bq. hmm I wonder why this wasn't done so far?

I think I understand now ... the decision on which path to take can only be 
determined after score(Collector) is called, or next()/skipTo(). Before that, 
i.e., when BW returns BS2 it does not know how it will be used, right? The 
decision is made by IndexSearcher.doSearch depending on whether there's a 
filter (next()/skipTo() are used) or not (score(Collector)).

So perhaps we should revert back to having start() on DISI? Since IndexSearcher 
can call start before iterating over the docs, but not if it uses 
scorer.score(Collector), which is delegated to the scorer. In that case, we 
should check whether the countingSumScorer was initialized and if not 
initialize it outselves.

Am I missing something?

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-27 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703406#action_12703406
 ] 

Eks Dev commented on LUCENE-1618:
-

Maybe, 
FileSwitchDirectory should have possibility to get file list/extensions that 
should be loaded into RAM... making it maintenance free, pushing this decision 
to end user... if, and when we decide to support users in it, we could than 
maintain static list at separate place . Kind of separate execution and 
configuration

I *think* I saw something similar Ning Lee made quite a while ago, from hadoop 
camp (indexing on hadoop something...). But cannot remember what was it :(


  

 Allow setting the IndexWriter docstore to be a different directory
 --

 Key: LUCENE-1618
 URL: https://issues.apache.org/jira/browse/LUCENE-1618
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

   Original Estimate: 336h
  Remaining Estimate: 336h

 Add an IndexWriter.setDocStoreDirectory method that allows doc
 stores to be placed in a different directory than the IW default
 dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703404#action_12703404
 ] 

Michael McCandless commented on LUCENE-1593:


{quote}
Some extensions of PQ may not know how to construct such a sentinel object. 
Consider ComparablePQ, which assumes all items are Comparable. Unlike HitQueue, 
it does not know what will be the items stored in the queue. But ... I guess it 
can return a Comparable which always prefers the other element ... so maybe 
that's not a good example.
I just have the feeling that a setter method will give us more freedom, rather 
than having to extend PQ just for that ...
{quote}

Such extensions shouldn't use a sentinel?

Various things spook me about the separate method: one could easily
pass in bad sentinels, and then the queue is in an invalid state; the
method can be called at any time (whereas the only time you should do
this is on init); you could pass in wrong-sized array; the API is
necessarily public (whereas with getSentinel() it'd be protected).

We can mull it over some more... sleep on it ;)

bq. Right ... it's too late for me

I've been starting to wonder if you are a robot...

bq. hmm I wonder why this wasn't done so far?

I don't know!  Seems like a simple optimization.  So we don't need
start/end (now at least).

bq. oIt's just that maintaining that class becomes more and more problematic.

I completely agree: this is the tradeoff we have to mull.  But I like
that all these classes are private (it hides the fact that there are
12 concrete impls).

I think I'd lean towards the 12 impls now.  They are tiny classes.

bq. But I wonder from where would we control it

Hmm.. yeah good point.  The only known place in Lucene's core that
visits hits out of order is BooleanScorer.  But presumably an external
Query somewhere may provide a Scorer that does things out of order
(maybe Solr does?), and so technically making the core collectors not
break ties by docID by default is a break in back-compat.

Maybe we should add a docsInOrder() method to Scorer?  By default it
returns false, but we fix that to return true for all core Lucene
queries?  And then IndexSearcher consults that to decide whether it
can do this?


 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1597) New Document and Field API

2009-04-27 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703407#action_12703407
 ] 

Michael Busch commented on LUCENE-1597:
---

{quote}
Can we maybe rename Descriptor - Type? Eg FieldDescriptor -
FieldType?
{quote}

Done.

{quote}
Can a single FieldDescriptor be shared among many fields? Seems like
we'd have to take name out of FieldDescriptor (I don't think the name
should be in FieldDescriptor, anyway).
{quote}

I agree, this should be possible. I removed the name.

{quote}
NumericFieldAttribute seems awkward (one shouldn't have to turn on/off
zero padding, trie; or rather it's better to operate in use cases
like I want to do range filtering or I want to sort). Seems like
maybe we need a SortAttribute and RangeFilterAttribute
(or... something).
{quote}

Yep I agree. Some things in this prototype are quite goofy, because I 
wanted to mainly demonstrate the main ideas. The attributes you suggest
make sense to me.


 New Document and Field API
 --

 Key: LUCENE-1597
 URL: https://issues.apache.org/jira/browse/LUCENE-1597
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Attachments: lucene-new-doc-api.patch


 This is a super rough prototype of how a new document API could look like. 
 It's basically what I came up with during a long flight across the Atlantic :)
 It is not integrated with anything yet (like IndexWriter, DocumentsWriter, 
 etc.) and heavily uses Java 1.5 features, such as generics and annotations.
 The general idea sounds similar to what Marvin is doing in KS, which I found 
 out by reading Mike's comments on LUCENE-831, I haven't looked at the KS API 
 myself yet. 
 Main ideas:
 - separate a field's value from its configuration; therefore this patch 
 introduces two classes: FieldDescriptor and FieldValue
 - I was thinking that in most cases the documents people add to a Lucene 
 index look alike, i.e. they contain mostly the same fields with the same 
 settings. Yet, for every field instance the DocumentsWriter checks the 
 settings and calls the right consumers, which themselves check settings and 
 return true or false, indicating whether or not they want to do something 
 with that field or not. So I was thinking we could design the document API 
 similar to the Class-Object concept of OO-languages. There a class is a 
 blueprint (as everyone knows :) ), and an object is one instance of it. So in 
 this patch I introduced a class called DocumentDescriptor, which contains all 
 FieldDescriptors with the field settings. This descriptor is given to the 
 consumer (IndexWriter) once in the constructor. Then the Document instances 
 are created and added via addDocument().
 - A Document instance allows adding variable fields in addition to the 
 fixed fields the DocumentDescriptor contains. For these fields the 
 consumers have to check the field settings for every document instance (like 
 with the old document API). This is for maintaining Lucene's flexibility that 
 everyone loves.
 - Disregard the changes to AttributeSource for now. The code that's worth 
 looking at is contained in a new package newdoc.
 Again, this is not a real patch, but rather a demo of how a new API could 
 roughly work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703410#action_12703410
 ] 

Michael McCandless commented on LUCENE-1593:



bq. Use FMPP?

I think the first question we need to answer is whether we cutover to
specialization for this.  At this point I don't think we need to, yet
(I think the 12 classes is tolerable, since they are tiny and
private).

The second question is, if we do switch to specialization at some
point (which I think we should: the performance gains are sizable),
how should we do the generation (Python, Java, FMPP, XSLT, etc.).  I
think it's a long time before we need to make that decision (many
iterations remain on LUCENE-1594).


 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703412#action_12703412
 ] 

Michael McCandless commented on LUCENE-1593:


bq.  the decision on which path to take can only be determined after 
score(Collector) is called, or next()/skipTo().

Oh I see: the Scorer cannot know on creation if it's the top scorer 
(score(Collector) will be called), or a secondary one (next()/skipTo(...) will 
be called).

Hmm yeah maybe back to DISI.start().  I think as long the actual code that will 
next()/skipTo(...) through the iterator is the only one that calls start(), the 
BS/BS2 double-start problem won't happen?

Really, somehow, it should be explicit when a Scorer will be topmost.  
IndexSearcher knows this when it creates it.

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1593.patch, PerfTest.java


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703415#action_12703415
 ] 

Michael McCandless commented on LUCENE-1618:


bq. Would also further suggest that this Directory implementation would take 
one or more directories to store documents, along with one or more directories 
to store the index itself

You mean an opened IndexOutput would write its output to two (or more) 
different places?  So you could write through a RAMDir down to an FSDir?  
(This way both the RAMDir and FSDir have a copy of the index).

 Allow setting the IndexWriter docstore to be a different directory
 --

 Key: LUCENE-1618
 URL: https://issues.apache.org/jira/browse/LUCENE-1618
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

   Original Estimate: 336h
  Remaining Estimate: 336h

 Add an IndexWriter.setDocStoreDirectory method that allows doc
 stores to be placed in a different directory than the IW default
 dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703416#action_12703416
 ] 

Michael McCandless commented on LUCENE-1618:


{quote}
ileSwitchDirectory should have possibility to get file list/extensions that 
should be loaded into RAM... making it maintenance free, pushing this decision 
to end user... if, and when we decide to support users in it, we could than 
maintain static list at separate place . Kind of separate execution and 
configuration
{quote}

+1

With flexible indexing, presumably one could use their codec to ask it for the 
doc store extensions vs the postings extensions, etc., and pass to this 
configurable FileSwitchDirectory.

 Allow setting the IndexWriter docstore to be a different directory
 --

 Key: LUCENE-1618
 URL: https://issues.apache.org/jira/browse/LUCENE-1618
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

   Original Estimate: 336h
  Remaining Estimate: 336h

 Add an IndexWriter.setDocStoreDirectory method that allows doc
 stores to be placed in a different directory than the IW default
 dir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-04-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703419#action_12703419
 ] 

Michael McCandless commented on LUCENE-1313:


{quote}
Sometimes we want a mixture of
RAMDir segments and FSDir segments to merge (when we've decided
we have too much in ram),
{quote}

I don't think we want to mix RAM  disk merging?

EG when RAM is full, we want to quickly flush it to disk as a single
segment.  Merging with disk segments only makes that flush slower?

{quote}
I'm still a little confused as to
why having a wrapper class that manages a disk writer and a ram
writer isn't cleaner?
{quote}

This is functionally the same as not mixing RAM vs disk merging,
right (ie just as clean)?


 Realtime Search
 ---

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch


 Realtime search with transactional semantics.  
 Possible future directions:
   * Optimistic concurrency
   * Replication
 Encoding each transaction into a set of bytes by writing to a RAMDirectory 
 enables replication.  It is difficult to replicate using other methods 
 because while the document may easily be serialized, the analyzer cannot.
 I think this issue can hold realtime benchmarks which include indexing and 
 searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RangeQuery and getTerm

2009-04-27 Thread Mark Miller
RangeQuery is based on two terms rather than one, and currently returns 
null from getTerm.


This can lead to less than obvious null pointer exceptions. I'd almost 
prefer to throw UnsupportedOperationException.


However, returning null allows you to still use getTerm on 
MultiTermQuery and do a null check in the RangeQuery case. Not sure how 
valuable that really is though.


Thoughts?

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Build failed in Hudson: Lucene-trunk #810

2009-04-27 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/810/changes

Changes:

[mikemccand] LUCENE-1615: remove some more deprecated uses of Fieldable.omitTf

[mikemccand] remove redundant CHANGES entries from trunk if they are already 
covered in 2.4.1

--
[...truncated 2887 lines...]
compile-test:
 [echo] Building benchmark...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

compile-demo:

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

init:

clover.setup:

clover.info:

clover:

common.compile-core:

compile-core:

compile-demo:

compile-highlighter:
 [echo] Building highlighter...

build-memory:

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

clover.setup:

clover.info:

clover:

common.compile-core:

compile-core:

compile:

check-files:

init:

clover.setup:

clover.info:

clover:

compile-core:

common.compile-test:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/benchmark/classes/test
 
[javac] Compiling 9 source files to 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/benchmark/classes/test
 
[javac] Note: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/benchmark/src/test/org/apache/lucene/benchmark/quality/TestQualityRun.java
  uses or overrides a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
 [copy] Copying 2 files to 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/benchmark/classes/test
 

build-artifacts-and-tests:
 [echo] Building collation...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

compile-misc:
 [echo] Building misc...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

clover.setup:

clover.info:

clover:

compile-core:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc/classes/java
 
[javac] Compiling 16 source files to 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc/classes/java
 
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.

compile:

init:

clover.setup:

clover.info:

clover:

compile-core:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/classes/java
 
[javac] Compiling 4 source files to 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/classes/java
 
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.

jar-core:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/lucene-collation-2.4-SNAPSHOT.jar
 

jar:

compile-test:
 [echo] Building collation...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

compile-misc:
 [echo] Building misc...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

clover.setup:

clover.info:

clover:

compile-core:

compile:

init:

clover.setup:

clover.info:

clover:

compile-core:

common.compile-test:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/classes/test
 
[javac] Compiling 5 source files to 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/classes/test
 
[javac] Note: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/collation/src/test/org/apache/lucene/collation/CollationTestBase.java
  uses or overrides a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.

build-artifacts-and-tests:

bdb:
 [echo] Building bdb...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

contrib-build.init:

get-db-jar:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/db/bdb/lib
 
  [get] Getting: http://downloads.osafoundation.org/db/db-4.7.25.jar
  [get] To: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/db/bdb/lib/db-4.7.25.jar
 
  [get] Error getting http://downloads.osafoundation.org/db/db-4.7.25.jar 
to 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/db/bdb/lib/db-4.7.25.jar
 

BUILD FAILED
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build.xml :628: 
The following error occurred while 

Re: [Lucene-java Wiki] Update of LuceneAtApacheConUs2009 by MichaelBusch

2009-04-27 Thread Michael Busch
I'm happy to give more than one talk, on the other hand I don't want to 
prevent others from presenting. So if anyone likes to give similar talks 
to the ones I suggested, please let us know.


-Michael

On 4/27/09 10:07 PM, Apache Wiki wrote:

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Lucene-java Wiki for 
change notification.

The following page has been changed by MichaelBusch:
http://wiki.apache.org/jakarta-lucene/LuceneAtApacheConUs2009

--
Let's wait to fill this in until Concom provides us a list from the regular 
CFP process.

   = Possible Talks or Tutorials =
-  * Lucene Basics (Michael Busch)
+  * Lucene Basics (Michael Busch or others?)
* Intro to Solr (:  Hoss out of the box talk?)
* Intro to Nutch and/or Nutch Vertical Search (Andrzej Bialecki) (when was 
the last time we had a Nutch talk? ''probably never...'')
* Mime Magic with Apache Tika (Jukka Zitting)
@@ -34, +34 @@




-  * New Features in Lucene (Michael Busch)
+  * New Features in Lucene (Michael Busch or others?)
* Advanced Lucene Indexing (Michael Busch)
* Building Intelligent Search Applications with the Lucene Ecosystem (Grant 
Ingersoll)  - see abstract at bottom
* Solr Operations and Performance Tuning

   



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1567) New flexible query parser

2009-04-27 Thread Bertrand Delacretaz (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703504#action_12703504
 ] 

Bertrand Delacretaz commented on LUCENE-1567:
-

Grant, the ip-clearance document that you created under incubator-public in svn 
had not been added to the site-publish folder, I just did that in revision 
769253. If that's not correct, please remove both xml and html versions of the 
lucene-query-parser file there.

 New flexible query parser
 -

 Key: LUCENE-1567
 URL: https://issues.apache.org/jira/browse/LUCENE-1567
 Project: Lucene - Java
  Issue Type: New Feature
  Components: QueryParser
 Environment: N/A
Reporter: Luis Alves
Assignee: Grant Ingersoll
 Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, 
 lucene_trunk_FlexQueryParser_2009March26_v3.patch


 From New flexible query parser thread by Micheal Busch
 in my team at IBM we have used a different query parser than Lucene's in
 our products for quite a while. Recently we spent a significant amount
 of time in refactoring the code and designing a very generic
 architecture, so that this query parser can be easily used for different
 products with varying query syntaxes.
 This work was originally driven by Andreas Neumann (who, however, left
 our team); most of the code was written by Luis Alves, who has been a
 bit active in Lucene in the past, and Adriano Campos, who joined our
 team at IBM half a year ago. Adriano is Apache committer and PMC member
 on the Tuscany project and getting familiar with Lucene now too.
 We think this code is much more flexible and extensible than the current
 Lucene query parser, and would therefore like to contribute it to
 Lucene. I'd like to give a very brief architecture overview here,
 Adriano and Luis can then answer more detailed questions as they're much
 more familiar with the code than I am.
 The goal was it to separate syntax and semantics of a query. E.g. 'a AND
 b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
 We distinguish the semantics of the different query components, e.g.
 whether and how to tokenize/lemmatize/normalize the different terms or
 which Query objects to create for the terms. We wanted to be able to
 write a parser with a new syntax, while reusing the underlying
 semantics, as quickly as possible.
 In fact, Adriano is currently working on a 100% Lucene-syntax compatible
 implementation to make it easy for people who are using Lucene's query
 parser to switch.
 The query parser has three layers and its core is what we call the
 QueryNodeTree. It is a tree that initially represents the syntax of the
 original query, e.g. for 'a AND b':
   AND
  /   \
 A B
 The three layers are:
 1. QueryParser
 2. QueryNodeProcessor
 3. QueryBuilder
 1. The upper layer is the parsing layer which simply transforms the
 query text string into a QueryNodeTree. Currently our implementations of
 this layer use javacc.
 2. The query node processors do most of the work. It is in fact a
 configurable chain of processors. Each processors can walk the tree and
 modify nodes or even the tree's structure. That makes it possible to
 e.g. do query optimization before the query is executed or to tokenize
 terms.
 3. The third layer is also a configurable chain of builders, which
 transform the QueryNodeTree into Lucene Query objects.
 Furthermore the query parser uses flexible configuration objects, which
 are based on AttributeSource/Attribute. It also uses message classes that
 allow to attach resource bundles. This makes it possible to translate
 messages, which is an important feature of a query parser.
 This design allows us to develop different query syntaxes very quickly.
 Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
 underlying processors and builders in a few days. We now have a 100%
 compatible Lucene query parser, which means the syntax is identical and
 all query parser test cases pass on the new one too using a wrapper.
 Recent posts show that there is demand for query syntax improvements,
 e.g improved range query syntax or operator precedence. There are
 already different QP implementations in Lucene+contrib, however I think
 we did not keep them all up to date and in sync. This is not too
 surprising, because usually when fixes and changes are made to the main
 query parser, people don't make the corresponding changes in the contrib
 parsers. (I'm guilty here too)
 With this new architecture it will be much easier to maintain different
 query syntaxes, as the actual code for the first layer is not very much.
 All syntaxes would benefit from patches and improvements we make to the
 underlying layers, which will make supporting different syntaxes much
 more manageable.

-- 
This message is automatically