[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703034#action_12703034 ] Michael McCandless commented on LUCENE-1616: Should we deprecate the separate setters with this addition? add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703038#action_12703038 ] Uwe Schindler commented on LUCENE-1616: --- Not really, the attributes API was added for 2.9, so it did not appear until now in official releases, it could be just removed. add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: CHANGES.txt
OK I fixed CHANGES.txt to not have double entries for the same issue in 2.4.1 and trunk (ie, the entry is only in 2.4.1's CHANGES section). And going forward, if a trunk issue gets backported to a point release, we should de-dup the entries on releasing the point release. Ie, before the point release is released, trunk can contain XXX as well as the branch for the point release, but on copying back the branch's CHANGES entries, we de-dup then. I'll update ReleaseTodo in the wiki. Thanks Steven! Mike On Sat, Apr 25, 2009 at 5:43 AM, Michael McCandless luc...@mikemccandless.com wrote: On Fri, Apr 24, 2009 at 7:17 PM, Steven A Rowe sar...@syr.edu wrote: Maybe even tiny bug fixes should always be called out on trunk's CHANGES. Or, maybe a tiny bug fix that also gets backported to a point release, must then be called out in both places? I think I prefer the 2nd. The difference between these two options is that in the 2nd, tiny bug fixes are mentioned in trunk's CHANGES only if they are backported to a point release, right? For the record, the previous policy (the zeroth option :) appears to be that backported bug fixes, regardless of size, are mentioned only once, in the CHANGES for the (chronologically) first release in which they appeared. You appear to oppose this policy, because (paraphrasing): people would wonder whether point release fixes were also fixed on following major/minor releases. IMNSHO, however, people (sometimes erroneously) view product releases as genetically linear: naming a release A.(B)[.x] implies inclusion of all changes to any release A.B[.y]. I.e., my sense is quite the opposite of yours: I would be *shocked* if bug fixes included in version 2.4.1 were not included (or explicitly called out as not included) in version 2.9.0. If more than one point release branch is active at any one time, then things get more complicated (genetic linearity can no longer be assumed), and your new policy seems like a reasonable attempt at managing the complexity. But will Lucene ever have more than one active bugfix branch? It never has before. But maybe I'm not understanding your intent: are you distinguishing between released CHANGES and unreleased CHANGES? That is, do you intend to apply this new policy only to the unreleased trunk CHANGES, but then remove the redundant bugfix notices once a release is performed? OK you've convinced me (to go back to the 0th policy)! Users can and should assume on seeing a point release containing XXX that all future releases also include XXX. Ie, CHANGES should not be a vehicle for confirming that this is what happened. So if XXX is committed to trunk and got a CHANGES entry, if it a later time it's back ported to a point release, I will remove the XXX from the trunk CHANGES and put it *only* on the point releases CHANGES. Also, I'll go and fix CHANGES, to remove the trunk entries when there's a point-release entry, if nobody objects in the next day or so. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1615) deprecated method used in fieldsReader / setOmitTf()
[ https://issues.apache.org/jira/browse/LUCENE-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1615. Resolution: Fixed OK I just committed this -- thanks Eks! deprecated method used in fieldsReader / setOmitTf() Key: LUCENE-1615 URL: https://issues.apache.org/jira/browse/LUCENE-1615 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Eks Dev Priority: Trivial Attachments: LUCENE-1615.patch setOmitTf(boolean) is deprecated and should not be used by core classes. One place where it appears is FieldsReader , this patch fixes it. It was necessary to change Fieldable to AbstractField at two places, only local variables. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703041#action_12703041 ] Michael McCandless commented on LUCENE-1616: Oh yeah :) Good! I'm losing track of what's not yet released... Eks, can you update the patch with that? Thanks. add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1616: --- Fix Version/s: 2.9 add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703054#action_12703054 ] Earwin Burrfoot commented on LUCENE-1616: - Separate setters might have their own use? I believe I had a pair of filters that set begin and end offset in different parts of the code. add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703062#action_12703062 ] Michael McCandless commented on LUCENE-1616: But surely that's a very rare case (the exception, not the rule). Ie nearly always, one sets start end offset together? add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Adding testpackage to common-build.xml
This sounds like a great change! It would also allow us to test the other (index, store, etc.) packages too? I don't think this is possible today, though I'm no expert with ant so it's entirely possible I've missed it. Presumably once we modularize, then the module would be the natural unit for testing (but it seems like this will be a ways off...). Mike On Mon, Apr 27, 2009 at 7:01 AM, Shai Erera ser...@gmail.com wrote: Hi I noticed that one can define testcase to execute just one test class, which is convenient. However, I didn't notice any equivalent for testing a whole package. Lately, when handling all those search issues, I often wanted to run all the tests in o.a.l.search and just them, but couldn't so I either ran a single test class when it was obvious (like TestSort) or the test-core when it was less obvious (like changes to Collector, or BooleanScorer). I wrote a simple patch which adds this capability to common-build.xml. I would like to confirm first that you agree to add this change and that I didn't miss it and this capability exists elsewhere already. Shai - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Adding testpackage to common-build.xml
ok then I'll open an issue and post the patch. You can review it and give it a try Shai On Mon, Apr 27, 2009 at 2:10 PM, Michael McCandless luc...@mikemccandless.com wrote: This sounds like a great change! It would also allow us to test the other (index, store, etc.) packages too? I don't think this is possible today, though I'm no expert with ant so it's entirely possible I've missed it. Presumably once we modularize, then the module would be the natural unit for testing (but it seems like this will be a ways off...). Mike On Mon, Apr 27, 2009 at 7:01 AM, Shai Erera ser...@gmail.com wrote: Hi I noticed that one can define testcase to execute just one test class, which is convenient. However, I didn't notice any equivalent for testing a whole package. Lately, when handling all those search issues, I often wanted to run all the tests in o.a.l.search and just them, but couldn't so I either ran a single test class when it was obvious (like TestSort) or the test-core when it was less obvious (like changes to Collector, or BooleanScorer). I wrote a simple patch which adds this capability to common-build.xml. I would like to confirm first that you agree to add this change and that I didn't miss it and this capability exists elsewhere already. Shai - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703067#action_12703067 ] Earwin Burrfoot commented on LUCENE-1616: - I have two cases. In one case I can't access the start offset by the time I set end offset, and therefore have to introduce a field on the filter for keeping track of it (or use the next case's solution twice), if separate setters are removed. In other case I only need to adjust end offset, so I'll have to do attr.setOffset(attr.getStartOffset(), newEndOffset). Nothing deadly, but I don't see the point of removing methods that might be useful and don't interfere with anything. add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Adding testpackage to common-build.xml
Hi I noticed that one can define testcase to execute just one test class, which is convenient. However, I didn't notice any equivalent for testing a whole package. Lately, when handling all those search issues, I often wanted to run all the tests in o.a.l.search and just them, but couldn't so I either ran a single test class when it was obvious (like TestSort) or the test-core when it was less obvious (like changes to Collector, or BooleanScorer). I wrote a simple patch which adds this capability to common-build.xml. I would like to confirm first that you agree to add this change and that I didn't miss it and this capability exists elsewhere already. Shai
[jira] Updated: (LUCENE-1617) Add testpackage to common-build.xml
[ https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1617: --- Attachment: LUCENE-1617.patch Add testpackage to common-build.xml - Key: LUCENE-1617 URL: https://issues.apache.org/jira/browse/LUCENE-1617 Project: Lucene - Java Issue Type: Improvement Components: Build Reporter: Shai Erera Priority: Minor Fix For: 2.9 Attachments: LUCENE-1617.patch One can define testcase to execute just one test class, which is convenient. However, I didn't notice any equivalent for testing a whole package. I find it convenient to be able to test packages rather than test cases because often it is not so clear which test class to run. Following patch allows one to ant test -Dtestpackage=search (for example) and run all tests under the \*/search/\* packages in core, contrib and tags, or do ant test-core -Dtestpackage=search and execute similarly just for core, or do ant test-core -Dtestpacakge=lucene/search/function and run all the tests under \*/lucene/search/function/\* (just in case there is another o.a.l.something.search.function package out there which we want to exclude. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1617) Add testpackage to common-build.xml
Add testpackage to common-build.xml - Key: LUCENE-1617 URL: https://issues.apache.org/jira/browse/LUCENE-1617 Project: Lucene - Java Issue Type: Improvement Components: Build Reporter: Shai Erera Priority: Minor Fix For: 2.9 Attachments: LUCENE-1617.patch One can define testcase to execute just one test class, which is convenient. However, I didn't notice any equivalent for testing a whole package. I find it convenient to be able to test packages rather than test cases because often it is not so clear which test class to run. Following patch allows one to ant test -Dtestpackage=search (for example) and run all tests under the \*/search/\* packages in core, contrib and tags, or do ant test-core -Dtestpackage=search and execute similarly just for core, or do ant test-core -Dtestpacakge=lucene/search/function and run all the tests under \*/lucene/search/function/\* (just in case there is another o.a.l.something.search.function package out there which we want to exclude. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703085#action_12703085 ] Eks Dev commented on LUCENE-1616: - I am ok with both options, removing separate looks a bit better for me as it forces users to think attomic about offset = {start, end}. If you separate start and end offset too far in your code, probability that you do not see mistake somewhere is higher compared to the case where you manage start and end on your own in these cases (this is then rather explicit in you code)... But that is all really something we should not think too much about it :) We make no mistakes eather way I can provide new patch, if needed. add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms
[ https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-1604: -- Assignee: Michael McCandless Stop creating huge arrays to represent the absense of field norms - Key: LUCENE-1604 URL: https://issues.apache.org/jira/browse/LUCENE-1604 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Reporter: Shon Vella Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1604.patch, LUCENE-1604.patch, LUCENE-1604.patch Creating and keeping around huge arrays that hold a constant value is very inefficient both from a heap usage standpoint and from a localility of reference standpoint. It would be much more efficient to use null to represent a missing norms table. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms
[ https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703101#action_12703101 ] Michael McCandless commented on LUCENE-1604: I tested this change on a Wikipedia index, with query 1, on a field that has norms. On Linux, JDK 1.6.0_13, I can see no performance difference (both get 7.2 qps, best of 10 runs). On Mac OS X 10.5.6, I see some difference (13.0 vs 12.3, best of 10 runs), but given quirkiness I've seen on OS X's results not matching other platforms, I think we can disgregard this. Also, given the performance gain one sees when norms are disabled, I think this is net/net a good change. We'll leave the default as false (for back compat), but this setting is deprecated with a comment that in 3.0 it hardwires to true. Stop creating huge arrays to represent the absense of field norms - Key: LUCENE-1604 URL: https://issues.apache.org/jira/browse/LUCENE-1604 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Reporter: Shon Vella Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1604.patch, LUCENE-1604.patch, LUCENE-1604.patch Creating and keeping around huge arrays that hold a constant value is very inefficient both from a heap usage standpoint and from a localility of reference standpoint. It would be much more efficient to use null to represent a missing norms table. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms
[ https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703103#action_12703103 ] Michael McCandless commented on LUCENE-1604: New patch attached: * Fixed contrib/instantiated contrib/misc to pass if I change default for disableFakeNorms to true (which we will hardwire in 3.0) * Tweaked javadocs * Removed unused imports * Added CHANGES.txt entry I still need to review the rest of the patch... With this patch, all tests pass with the default set to false (back-compat). If I temporarily set it to true, all tests now pass, except back-compat (which is expected fine). I had started down the path of having contrib/instantiated respect the disableFakeNorms setting, but rapidly came to realize how little I understand contrib/instantiated's code ;) So I fell back to fixing the unit tests to accept null returns from the normal IndexReader.norms(...). Stop creating huge arrays to represent the absense of field norms - Key: LUCENE-1604 URL: https://issues.apache.org/jira/browse/LUCENE-1604 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Reporter: Shon Vella Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1604.patch, LUCENE-1604.patch, LUCENE-1604.patch Creating and keeping around huge arrays that hold a constant value is very inefficient both from a heap usage standpoint and from a localility of reference standpoint. It would be much more efficient to use null to represent a missing norms table. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms
[ https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1604: --- Attachment: LUCENE-1604.patch Attached patch. I also added assert !getDisableFakeNorms(); inside SegmentReader.fakeNorms(). Stop creating huge arrays to represent the absense of field norms - Key: LUCENE-1604 URL: https://issues.apache.org/jira/browse/LUCENE-1604 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Reporter: Shon Vella Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1604.patch, LUCENE-1604.patch, LUCENE-1604.patch, LUCENE-1604.patch Creating and keeping around huge arrays that hold a constant value is very inefficient both from a heap usage standpoint and from a localility of reference standpoint. It would be much more efficient to use null to represent a missing norms table. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1593: --- Attachment: PerfTest.java LUCENE-1593.patch The patch implements all that has been suggested except: * pre-populating the queue in TopFieldCollector - as was noted here previously, this seems to remove the 'if (queueFull)' check but add another 'if' in FieldComparator (which may be executed several times per collect(). * Move initCountingSumScorer() to BS2's ctor and add(). That's because if more than one Scorer is added we create a DisjunctionSumScorer, which initializes its queue by calling next() on the passed-in Scorer. Therefore if we call initCountingSumScorer for every Scorer added, we advance that Scorer as well as all the previous ones. I chose to discard that optimization, which only affects next() and skipTo(). The patch also includes the fix for TestSort in the 2.4 back_compat branch. I only fixed TestSort, and not MultiSearcher and ParallelMultiSearcher. All tests pass. I also ran some performance measurements (all on SRV 2003): || JRE || sort || best time (trunk) || best time (patch) || diff (%) || | SUN 1.6 | int | 1017.59 | 1015.96 | {color:green}~1%{color} | | SUN 1.6 | doc | 767.49 | 763.20 | {color:green}~1%{color} | | IBM 1.5 | int | 1018.77 | 1017.39 | {color:green}~1%{color} | | IBM 1.5 | doc | 768.10 | 764.14 | {color:green}~1%{color} | As you can see, there is a slight performance improvement, but nothing too dramatic. You are welcome to review the patch as well as run the PerfTest I attached. It accepts two arguments: indexDir and [sort]. 'sort' is optional and if not defined it sorts by doc. Otherwise, whatever you pass there, it sorts by int. Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms
[ https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703142#action_12703142 ] Michael McCandless commented on LUCENE-1604: OK patch looks good! I plan to commit in a day or two. Thanks Shon! Stop creating huge arrays to represent the absense of field norms - Key: LUCENE-1604 URL: https://issues.apache.org/jira/browse/LUCENE-1604 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Reporter: Shon Vella Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1604.patch, LUCENE-1604.patch, LUCENE-1604.patch, LUCENE-1604.patch Creating and keeping around huge arrays that hold a constant value is very inefficient both from a heap usage standpoint and from a localility of reference standpoint. It would be much more efficient to use null to represent a missing norms table. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1617) Add testpackage to common-build.xml
[ https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703156#action_12703156 ] Michael McCandless commented on LUCENE-1617: I would like to be able to run them (one use case for this would be to parallelize tests -- I do this now (Python script) by running test-core/test-tag/test-contrib in parallel, but it's mis-balanced because contrib finishes so quickly). How about -Dtestrootonly=true? Add testpackage to common-build.xml - Key: LUCENE-1617 URL: https://issues.apache.org/jira/browse/LUCENE-1617 Project: Lucene - Java Issue Type: Improvement Components: Build Reporter: Shai Erera Priority: Minor Fix For: 2.9 Attachments: LUCENE-1617.patch One can define testcase to execute just one test class, which is convenient. However, I didn't notice any equivalent for testing a whole package. I find it convenient to be able to test packages rather than test cases because often it is not so clear which test class to run. Following patch allows one to ant test -Dtestpackage=search (for example) and run all tests under the \*/search/\* packages in core, contrib and tags, or do ant test-core -Dtestpackage=search and execute similarly just for core, or do ant test-core -Dtestpacakge=lucene/search/function and run all the tests under \*/lucene/search/function/\* (just in case there is another o.a.l.something.search.function package out there which we want to exclude. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703157#action_12703157 ] Earwin Burrfoot commented on LUCENE-1616: - bq. removing separate looks a bit better for me as it forces users to think attomic about offset = {start, end}. And if it's not atomic by design? bq. If you separate start and end offset too far in your code, probability that you do not see mistake somewhere is higher compared to the case where you manage start and end on your own in these cases (this is then rather explicit in you code)... Instead of having one field for Term, which you build incrementally, you now have to keep another field for startOffset. Imho, that's starting to cross into another meaning of 'explicit' :) And while you're trying to prevent bugs of using setStartOffset and forgetting about its 'End' counterpart, you introduce another set of bugs - overwriting one end of interval, when you only need to update another. bq. And in general I prefer one clear way to do something And force everyone who has slightly different use-case to jump through the hoops. Span*Query api is a perfect example. Well, whatever. add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
Ok, I'll create another patch a bit later today - Original Message From: Michael McCandless (JIRA) j...@apache.org To: java-dev@lucene.apache.org Sent: Monday, 27 April, 2009 16:34:30 Subject: [jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute [ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703144#action_12703144 ] Michael McCandless commented on LUCENE-1616: bq. removing separate looks a bit better for me as it forces users to think attomic about offset = {start, end}. This is my thinking as well. And in general I prefer one clear way to do something (the Python way) instead providing various different ways to do the same thing (the Perl way). add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1617) Add testpackage to common-build.xml
[ https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1617: --- Attachment: LUCENE-1617.patch Added another property testpackageroot. So now you can define: * testcase - for a single test class * testpackage - for all classes in a package, including sub-packages * testpackageroot - for all classes in a package, without sub-packages But something is strange ... if I run ant test-core it works ok. If I run ant test-core -Dtestpackage=lucene few classes fail, like AnalysisTest, IndexTest etc. (those that end with Test). That's because they are not TestCases ... I wonder why in ant test-core those files are skipped (and I see they are not executed) but in testpackage they are not. Anyway, I'll look into it later, unless someone who is more knowledgeable in Ant want to look at it. This is not ready to be committed, as ant test-core -Dtestpackage=lucene and ant test-core -Dtestpackageroot=lucene fail on those non-test cases files. Add testpackage to common-build.xml - Key: LUCENE-1617 URL: https://issues.apache.org/jira/browse/LUCENE-1617 Project: Lucene - Java Issue Type: Improvement Components: Build Reporter: Shai Erera Priority: Minor Fix For: 2.9 Attachments: LUCENE-1617.patch, LUCENE-1617.patch One can define testcase to execute just one test class, which is convenient. However, I didn't notice any equivalent for testing a whole package. I find it convenient to be able to test packages rather than test cases because often it is not so clear which test class to run. Following patch allows one to ant test -Dtestpackage=search (for example) and run all tests under the \*/search/\* packages in core, contrib and tags, or do ant test-core -Dtestpackage=search and execute similarly just for core, or do ant test-core -Dtestpacakge=lucene/search/function and run all the tests under \*/lucene/search/function/\* (just in case there is another o.a.l.something.search.function package out there which we want to exclude. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: CHANGES.txt
Thank you, Mike, for working to make things better. Steve On 4/27/2009 at 5:32 AM, Michael McCandless wrote: OK I fixed CHANGES.txt to not have double entries for the same issue in 2.4.1 and trunk (ie, the entry is only in 2.4.1's CHANGES section). And going forward, if a trunk issue gets backported to a point release, we should de-dup the entries on releasing the point release. Ie, before the point release is released, trunk can contain XXX as well as the branch for the point release, but on copying back the branch's CHANGES entries, we de-dup then. I'll update ReleaseTodo in the wiki. Thanks Steven! Mike On Sat, Apr 25, 2009 at 5:43 AM, Michael McCandless luc...@mikemccandless.com wrote: On Fri, Apr 24, 2009 at 7:17 PM, Steven A Rowe sar...@syr.edu wrote: Maybe even tiny bug fixes should always be called out on trunk's CHANGES. Or, maybe a tiny bug fix that also gets backported to a point release, must then be called out in both places? I think I prefer the 2nd. The difference between these two options is that in the 2nd, tiny bug fixes are mentioned in trunk's CHANGES only if they are backported to a point release, right? For the record, the previous policy (the zeroth option :) appears to be that backported bug fixes, regardless of size, are mentioned only once, in the CHANGES for the (chronologically) first release in which they appeared. You appear to oppose this policy, because (paraphrasing): people would wonder whether point release fixes were also fixed on following major/minor releases. IMNSHO, however, people (sometimes erroneously) view product releases as genetically linear: naming a release A.(B)[.x] implies inclusion of all changes to any release A.B[.y]. I.e., my sense is quite the opposite of yours: I would be *shocked* if bug fixes included in version 2.4.1 were not included (or explicitly called out as not included) in version 2.9.0. If more than one point release branch is active at any one time, then things get more complicated (genetic linearity can no longer be assumed), and your new policy seems like a reasonable attempt at managing the complexity. But will Lucene ever have more than one active bugfix branch? It never has before. But maybe I'm not understanding your intent: are you distinguishing between released CHANGES and unreleased CHANGES? That is, do you intend to apply this new policy only to the unreleased trunk CHANGES, but then remove the redundant bugfix notices once a release is performed? OK you've convinced me (to go back to the 0th policy)! Users can and should assume on seeing a point release containing XXX that all future releases also include XXX. Ie, CHANGES should not be a vehicle for confirming that this is what happened. So if XXX is committed to trunk and got a CHANGES entry, if it a later time it's back ported to a point release, I will remove the XXX from the trunk CHANGES and put it *only* on the point releases CHANGES. Also, I'll go and fix CHANGES, to remove the trunk entries when there's a point-release entry, if nobody objects in the next day or so. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703182#action_12703182 ] Michael McCandless commented on LUCENE-1616: bq. And force everyone who has slightly different use-case to jump through the hoops. Simple things should be simple and complex things should be possible is a strong guide when I'm thinking about APIs, configuration, etc. My feeling here is for the vast majority of the cases, people set start end offset together, so we should shift to the API that makes that easy. This is the simple case. For the remaining minority (your interesting use case), you can still do what you need but yes there are some hoops to go through. This is the complex case. bq. Span*Query api is a perfect example. Can you describe the limitations here in more detail? add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703197#action_12703197 ] Michael McCandless commented on LUCENE-1593: The patch still has various logic to handle the sentinel values, but we backed away from that optimization (it's not generally safe)? Also, I fear we need to conditionalize the don't need to break ties by docID, because BooleanScorer doesn't visit docs in order? bq. I chose to discard that optimization, which only affects next() and skipTo(). Maybe we should add a start() method to Scorer, to handle initializations like this, so that next() doesn't have to check every time? Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: perf enhancement and lucene-1345
I think you mean this thread: http://markmail.org/message/idgcnxmbyo3yjdiw right? I would love to see these in Lucene... P4Delta, which Paul has started under LUCENE-1410, is clearly a win, but is a biggish change to Lucene since all offsets would need to become blockID + offsetWithinBlock. LUCENE-1458 (further steps flexible indexing) tries to make things generic enough that P4Delta can simply be a different codec. On the logic operators for combining DocIDSets... how do these differ from what we already do in BooleanScorer[2]? (I haven't had a chance to get a good look at Kamikaze yet). Mike On Fri, Apr 24, 2009 at 11:34 PM, John Wang john.w...@gmail.com wrote: Hi Guys: A while ago I posted some enhancements to disjunction and conjunction docIdSetIterators that showed performance improvements to Lucene-1345. I think it got mixed up with another discussion on that issue. Was wondering what happened with it and what are the plans. Thanks -John - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703227#action_12703227 ] Michael McCandless commented on LUCENE-1616: Thanks Eks. You also need to fix all the places that call the old methods (things don't compile w/ the new patch). add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch, LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703228#action_12703228 ] Michael McCandless commented on LUCENE-1614: Shai did you forget to attach patch here? Or maybe you're just busy ;) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean Key: LUCENE-1614 URL: https://issues.apache.org/jira/browse/LUCENE-1614 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 See http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html for the full discussion. The basic idea is to add variants to those two methods that return the current doc they are at, to save successive calls to doc(). If there are no more docs, return -1. A summary of what was discussed so far: # Deprecate those two methods. # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI (calls next() and skipTo() respectively, and will be changed to abstract in 3.0). #* I actually would like to propose an alternative to the names: advance() and advance(int) - the first advances by one, the second advances to target. # Wherever these are used, do something like '(doc = advance()) = 0' instead of comparing to -1 for improved performance. I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1616: Attachment: LUCENE-1616.patch whoops, this time it compiles :) add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703233#action_12703233 ] Shai Erera commented on LUCENE-1614: No I did not forget - I need to work on it (trying to juggle all the issues I opened :)) ... in general I don't like to work on overlapping issues and this overlaps with 1593 (it will touch some of the same files). But I can start working on the patch - it looks much simpler than 1593 ... One thing I wanted to get feedback on is the proposal to use advance() and advance(target). Let's decide on that now, so that I don't need to refactor everything afterwards :) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean Key: LUCENE-1614 URL: https://issues.apache.org/jira/browse/LUCENE-1614 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 See http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html for the full discussion. The basic idea is to add variants to those two methods that return the current doc they are at, to save successive calls to doc(). If there are no more docs, return -1. A summary of what was discussed so far: # Deprecate those two methods. # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI (calls next() and skipTo() respectively, and will be changed to abstract in 3.0). #* I actually would like to propose an alternative to the names: advance() and advance(int) - the first advances by one, the second advances to target. # Wherever these are used, do something like '(doc = advance()) = 0' instead of comparing to -1 for improved performance. I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703245#action_12703245 ] Michael McCandless commented on LUCENE-1616: I still get compilation errors: {code} [mkdir] Created dir: /lucene/src/lucene.offsets/build/classes/java [javac] Compiling 372 source files to /lucene/src/lucene.offsets/build/classes/java [javac] /lucene/src/lucene.offsets/src/java/org/apache/lucene/analysis/KeywordTokenizer.java:62: cannot find symbol [javac] symbol : method setStartOffset(int) [javac] location: class org.apache.lucene.analysis.tokenattributes.OffsetAttribute [javac] offsetAtt.setStartOffset(0); [javac]^ [javac] /lucene/src/lucene.offsets/src/java/org/apache/lucene/analysis/KeywordTokenizer.java:63: cannot find symbol [javac] symbol : method setEndOffset(int) [javac] location: class org.apache.lucene.analysis.tokenattributes.OffsetAttribute [javac] offsetAtt.setEndOffset(upto); [javac]^ [javac] /lucene/src/lucene.offsets/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java:164: cannot find symbol [javac] symbol : method setStartOffset(int) [javac] location: class org.apache.lucene.analysis.tokenattributes.OffsetAttribute [javac] offsetAtt.setStartOffset(start); [javac] ^ [javac] /lucene/src/lucene.offsets/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java:165: cannot find symbol [javac] symbol : method setEndOffset(int) [javac] location: class org.apache.lucene.analysis.tokenattributes.OffsetAttribute [javac] offsetAtt.setEndOffset(start+termAtt.termLength()); [javac] ^ [javac] /lucene/src/lucene.offsets/src/java/org/apache/lucene/index/DocInverterPerThread.java:56: cannot find symbol [javac] symbol : method setStartOffset(int) [javac] location: class org.apache.lucene.analysis.tokenattributes.OffsetAttribute [javac] offsetAttribute.setStartOffset(startOffset); [javac] ^ [javac] /lucene/src/lucene.offsets/src/java/org/apache/lucene/index/DocInverterPerThread.java:57: cannot find symbol [javac] symbol : method setEndOffset(int) [javac] location: class org.apache.lucene.analysis.tokenattributes.OffsetAttribute [javac] offsetAttribute.setEndOffset(endOffset); [javac] ^ [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] 6 errors {code} add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703251#action_12703251 ] Marvin Humphrey commented on LUCENE-1614: - advance() and advance(int) In the interest of coherent email exchanges, I think it would be best to give these methods distinct names, e.g. nudge and advance. Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean Key: LUCENE-1614 URL: https://issues.apache.org/jira/browse/LUCENE-1614 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 See http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html for the full discussion. The basic idea is to add variants to those two methods that return the current doc they are at, to save successive calls to doc(). If there are no more docs, return -1. A summary of what was discussed so far: # Deprecate those two methods. # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI (calls next() and skipTo() respectively, and will be changed to abstract in 3.0). #* I actually would like to propose an alternative to the names: advance() and advance(int) - the first advances by one, the second advances to target. # Wherever these are used, do something like '(doc = advance()) = 0' instead of comparing to -1 for improved performance. I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703254#action_12703254 ] Eks Dev commented on LUCENE-1616: - me too, sorry! Eclipse left me blind for some funny reason waiting for test to complete before I commit again ... add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703266#action_12703266 ] Shai Erera commented on LUCENE-1593: bq. he patch still has various logic to handle the sentinel values Are you talking about TSDC? I thought we agreed that initializing to Float.NEG_INF is reasonable for TSDC? If not, then I can remove it from there as well as the changes done to PQ. bq. Maybe we should add a start() method to Scorer Could be useful - but then we should probably do it on DocIdSetIterator with default impl, and override where it makes sense (BS and BS2)? Also, if we do this, why not adding an end() too, allowing a DISI to release resources? And if we document that calling next() and skipTo() without start() before that may result in an unspecified behavior, it will resemble somewhat to TermPositions, where you have to call next() before anything else. However, this should be done with caution. BS2 calls initCountingSumScorer in two places: (1) next() and skipTo() and (2) score(Collector). Now, in the latter, it first checks if allowDocsOutOfOrder and if so initializes BS, with adding the Scorers that were added in add(). However those Scorers *must not be initalized* prior to creating BS, since they will be advanced. So now it gets tricky - upon call to start(), what should BS2 do? Check allowDocsOutOfOrder to determine if to initialize or not? And what if it is true but score(Collector) will not be called, and instead next() and skipTo()? We should also protect against calling start() more than once, and in Scorers that aggregate several scorers, we should make sure their start() is called after all Scorers were added ... gets a bit complicated. What do you think? bq. Also, I fear we need to conditionalize the don't need to break ties by docID, because BooleanScorer doesn't visit docs in order? Yes I kept BS and BS2 in mind ... if we condiionalize anything, it means extra 'if'. If we want to avoid that 'if', we need to create a variant of the class, which might not be so bad in TSDC, but will look awful in TFC (additional 6(?) classes). Perhaps we should still attempt to add to PQ if cmp == 0? Or did you have something else in mind? Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703288#action_12703288 ] Earwin Burrfoot commented on LUCENE-1616: - bq. Span*Query api is a perfect example. bq. Can you describe the limitations here in more detail? Take a look at SpanNearQuery and SpanOrQuery. 1. They don't provide incremental construction (i.e. add() method, like in BooleanQuery), and they can be built only from an array of subqueries. So, if you don't know exact amount of subqueries upfront, you're busted. You have to use ArrayList, which you convert to array to feed into SpanQuery, which is converted back to ArrayList inside!! 2. They can't be edited. If you have a need to iterate over your query tree and modify it in one way or another, you need to create brand new instances of Span*Query. And here you hit #1 again, hard. 3. They can't be even inspected without creating a new array from the backing list (see getClauses). I use patched versions of SpanNear/OrQueries, which still use backing ArrayList, but accept it in constructor, have utility 'add' method and getClauses() returns this very list, which allows for zero-cost inspection and easy modification if the need arises. add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703291#action_12703291 ] Shai Erera commented on LUCENE-1614: nudge doesn't sound like it changes anything, but just touches. So if distinct method names is what we're after, I prefer nextDoc() and skipToDoc() or advance() for the latter. Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean Key: LUCENE-1614 URL: https://issues.apache.org/jira/browse/LUCENE-1614 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 See http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html for the full discussion. The basic idea is to add variants to those two methods that return the current doc they are at, to save successive calls to doc(). If there are no more docs, return -1. A summary of what was discussed so far: # Deprecate those two methods. # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI (calls next() and skipTo() respectively, and will be changed to abstract in 3.0). #* I actually would like to propose an alternative to the names: advance() and advance(int) - the first advances by one, the second advances to target. # Wherever these are used, do something like '(doc = advance()) = 0' instead of comparing to -1 for improved performance. I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703315#action_12703315 ] Jason Rutherglen commented on LUCENE-1313: -- {quote} When we create SegmentWriteState (which is supposed to contain all details needed to tell DW how/where to write the segment), we'd set its directory to the RAMDir? That ought to be all that's needed (though, it's possible some places use a private copy of the original directory, which we should fix). DW should care less which Directory the segment is written to... {quote} Agreed that DW can write the segment to the RAMDir. I started coding along these lines however what do we do about the RAMDir merging? This is why I was thinking we'll need a separate IW? Otherwise the ram segments (if they are treated the same as disk segments) would quickly be merged to disk? Or we have two separate merging paths? If we have a disk IW and ram IW, I'm not sure how the docstores to disk part would work though I'm sure there's some way to do it. bq. modify resolveExternalSegments to accept a doMerge? Sounds good. Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703317#action_12703317 ] Jason Rutherglen commented on LUCENE-1313: -- {quote}we should make with NRT is to not close the doc store (stored fields, term vector) files when flushing for an NRT reader. {quote} Agreed, I think this feature is a must otherwise we're doing unnecessary in ram merging. {quote}we'd need to be able to somehow share an IndexInput IndexOutput; or, perhaps we can open an IndexInput even though an IndexOutput{quote} I ran into problems with this before, I was trying to reuse Directory to write a transaction log. It seemed theoretically doable however it didn't work in practice. It could have been the seeking and replacing but I don't remember. FSIndexOutput uses a writeable RAF and FSIndexInput is read only why would there be an issue? Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703327#action_12703327 ] Jason Rutherglen commented on LUCENE-1313: -- {quote}doc store files punch straight through to the real directory{quote} To implement this functionality in parallel (and perhaps make the overall patch cleaner), writing doc stores directly to a separate directory can be a different patch? There can be an option IW.setDocStoresDirectory(Directory) that the patch implements? Then some unit tests that are separate from the near realtime portion. Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1616: Attachment: LUCENE-1616.patch ok, maybe this time it will work, I hope I managed to clean it up (core build and test pass). The only thing that fails is contrib, but I guess this has nothing to do with it? [javac] D:\Repository\SerachAndMatch\Lucene\lucene\java\trunk\contrib\highlighter\src\java\org\apache\lucene\search\highlight\WeightedSpanTermExtractor.java:306: cannot find symbol [javac] MemoryIndex indexer = new MemoryIndex(); [javac] ^ [javac] symbol: class MemoryIndex [javac] location: class org.apache.lucene.search.highlight.WeightedSpanTermExtractor [javac] D:\Repository\SerachAndMatch\Lucene\lucene\java\trunk\contrib\highlighter\src\java\org\apache\lucene\search\highlight\WeightedSpanTermExtractor.java:306: cannot find symbol [javac] MemoryIndex indexer = new MemoryIndex(); [javac] ^ [javac] symbol: class MemoryIndex [javac] location: class org.apache.lucene.search.highlight.WeightedSpanTermExtractor [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 3 errors add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703329#action_12703329 ] Michael McCandless commented on LUCENE-1593: bq. I thought we agreed that initializing to Float.NEG_INF is reasonable for TSDC? Woops, sorry, you're right. I'm just losing my mind. I think the javadoc for PriorityQueue.addSentinelObjects should state that the Objects must all be logically equal? Ie we do a straight copy into the pqueue, so if they are not equal then the pqueue is in a messed up state. Actually that method is somewhat awkward. I wonder if, instead, we could define an Object getSentinelObject(), returning null by default, and the pqueue on creation would call that and if it's non-null, fill the queue (by calling it maxSize times)? bq. Could be useful - but then we should probably do it on DocIdSetIterator with default impl, and override where it makes sense (BS and BS2)? Also, if we do this, why not adding an end() too, allowing a DISI to release resources? Actually shouldn't Weight.scorer(...) in general be the place where such pre-next() initializatoin is done? EG BooleanWeight.scorer(...) should call BS2's initCountingSumScorer (and/or somehow forward to BS)? bq. Yes I kept BS and BS2 in mind ... if we condiionalize anything, it means extra 'if'. If we want to avoid that 'if', we need to create a variant of the class, which might not be so bad in TSDC, but will look awful in TFC (additional 6 classes). Yeah that's (the * 2 splintering) is what I was fearing. At some point we should leave this splintering to source code specialization...it's getting somewhat crazy now. bq. Perhaps we should still attempt to add to PQ if cmp == 0? That basically undoes the don't fallback to docID optimization right? bq. Or did you have something else in mind? The 6 new classes is what I feared we'd need to do. Else, with the changes here (that never break ties by docID), TopFieldCollector can't be used with BooleanScorer (which breaks back compat). I guess since the 6 classes are hidden under the TopFieldCollector.create it's maybe not so bad? Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703332#action_12703332 ] Mark Miller commented on LUCENE-1616: - bq. The only thing that fails is contrib, but I guess this has nothing to do with it? looks like an issue with highlighters dependency on memory index. what target produces the problem? We have seen something like it in the past. add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1613) TermEnum.docFreq() is not updated with there are deletes
[ https://issues.apache.org/jira/browse/LUCENE-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1270#action_1270 ] Matt Chaput edited comment on LUCENE-1613 at 4/27/09 1:11 PM: -- Given how fundamental the issue is w.r.t. how Lucene stores the index, it's unlikely to ever be fixed. (A clean, performant fix other than simply merging the segments would be a pretty incredible revelation.) As an outside observer I would argue against keeping the bug open forever for correctness sake. was (Author: mchaput): Given how fundamental the issue is w.r.t. how Lucene stores the index, it's unlikely to ever be fixed. (A clean, performant fix other than simply merging the segments would be pretty incredible revelation.) As an outside observer I would argue against keeping the bug open forever for correctness sake. TermEnum.docFreq() is not updated with there are deletes Key: LUCENE-1613 URL: https://issues.apache.org/jira/browse/LUCENE-1613 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.4 Reporter: John Wang Attachments: TestDeleteAndDocFreq.java TermEnum.docFreq is used in many places, especially scoring. However, if there are deletes in the index and it is not yet merged, this value is not updated. Attached is a test case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1613) TermEnum.docFreq() is not updated with there are deletes
[ https://issues.apache.org/jira/browse/LUCENE-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1270#action_1270 ] Matt Chaput commented on LUCENE-1613: - Given how fundamental the issue is w.r.t. how Lucene stores the index, it's unlikely to ever be fixed. (A clean, performant fix other than simply merging the segments would be pretty incredible revelation.) As an outside observer I would argue against keeping the bug open forever for correctness sake. TermEnum.docFreq() is not updated with there are deletes Key: LUCENE-1613 URL: https://issues.apache.org/jira/browse/LUCENE-1613 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.4 Reporter: John Wang Attachments: TestDeleteAndDocFreq.java TermEnum.docFreq is used in many places, especially scoring. However, if there are deletes in the index and it is not yet merged, this value is not updated. Attached is a test case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703337#action_12703337 ] Michael McCandless commented on LUCENE-1616: bq. I use patched versions of SpanNear/OrQueries, which still use backing ArrayList, but accept it in constructor, have utility 'add' method and getClauses() returns this very list, which allows for zero-cost inspection and easy modification if the need arises. That sounds useful -- is it something you can share? add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703335#action_12703335 ] Eks Dev commented on LUCENE-1616: - ant build-contrib add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1613) TermEnum.docFreq() is not updated with there are deletes
[ https://issues.apache.org/jira/browse/LUCENE-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703338#action_12703338 ] Mark Miller commented on LUCENE-1613: - This is a dupe I believe, but for the life of me, I cannot find the original to link them. TermEnum.docFreq() is not updated with there are deletes Key: LUCENE-1613 URL: https://issues.apache.org/jira/browse/LUCENE-1613 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.4 Reporter: John Wang Attachments: TestDeleteAndDocFreq.java TermEnum.docFreq is used in many places, especially scoring. However, if there are deletes in the index and it is not yet merged, this value is not updated. Attached is a test case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory
Allow setting the IndexWriter docstore to be a different directory -- Key: LUCENE-1618 URL: https://issues.apache.org/jira/browse/LUCENE-1618 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Add an IndexWriter.setDocStoreDirectory method that allows doc stores to be placed in a different directory than the IW default dir. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-1616: -- Assignee: Michael McCandless add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Assignee: Michael McCandless Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1619) TermAttribute.termLength() optimization
[ https://issues.apache.org/jira/browse/LUCENE-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eks Dev updated LUCENE-1619: Attachment: LUCENE-1619.patch TermAttribute.termLength() optimization --- Key: LUCENE-1619 URL: https://issues.apache.org/jira/browse/LUCENE-1619 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Attachments: LUCENE-1619.patch public int termLength() { initTermBuffer(); // This patch removes this method call return termLength; } I see no reason to initTermBuffer() in termLength()... all tests pass, but I could be wrong? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1617) Add testpackage to common-build.xml
[ https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703341#action_12703341 ] Shai Erera commented on LUCENE-1617: Ok, so I've done some research, and I'm really puzzled. Everywhere I read, it is mentioned that batchtest uses fileset to include test cases, and that you should include them using a pattern like **/Test*.java ... which is what is done already if none of the special test modes is specified (a single test, a package or package-root). However, for some reason if the definition looks like this, those non-TestCase classes are filtered out / skipped: {code} fileset dir=src/test includes=**/Test*.java,**/*Tets.java / {code} But if the definition looks like this, they are executed, which results in a failure: {code} fileset dir=src/test includes=**/lucene/Test*.java,**/lucene/*Tets.java / {code} As if the batchtest task behaves differently when the definition of includes contains a different pattern than the first one. I also tried to modify the dir attribute, to define src/test/org/apache/lucene, but that doesn't seem to solve the problem. So the only thing I can think of is to rename those classes to not start/end with Test? I'd hate to lose the ability to test an entire package, just because of that limitation. By running ant test-core -Dtestpackage=lucene I can discover all the non-test classes that start/end with Test. What do you think? Add testpackage to common-build.xml - Key: LUCENE-1617 URL: https://issues.apache.org/jira/browse/LUCENE-1617 Project: Lucene - Java Issue Type: Improvement Components: Build Reporter: Shai Erera Priority: Minor Fix For: 2.9 Attachments: LUCENE-1617.patch, LUCENE-1617.patch One can define testcase to execute just one test class, which is convenient. However, I didn't notice any equivalent for testing a whole package. I find it convenient to be able to test packages rather than test cases because often it is not so clear which test class to run. Following patch allows one to ant test -Dtestpackage=search (for example) and run all tests under the \*/search/\* packages in core, contrib and tags, or do ant test-core -Dtestpackage=search and execute similarly just for core, or do ant test-core -Dtestpacakge=lucene/search/function and run all the tests under \*/lucene/search/function/\* (just in case there is another o.a.l.something.search.function package out there which we want to exclude. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1619) TermAttribute.termLength() optimization
TermAttribute.termLength() optimization --- Key: LUCENE-1619 URL: https://issues.apache.org/jira/browse/LUCENE-1619 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Priority: Trivial Attachments: LUCENE-1619.patch public int termLength() { initTermBuffer(); // This patch removes this method call return termLength; } I see no reason to initTermBuffer() in termLength()... all tests pass, but I could be wrong? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1617) Add testpackage to common-build.xml
[ https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703346#action_12703346 ] Michael McCandless commented on LUCENE-1617: bq. So the only thing I can think of is to rename those classes to not start/end with Test? I think this is an OK workaround for the ant spookiness? (We could also ask our resident ant expert to figure it out ;) ) I think these classes are quite old and probably never used by anyone anymore. Add testpackage to common-build.xml - Key: LUCENE-1617 URL: https://issues.apache.org/jira/browse/LUCENE-1617 Project: Lucene - Java Issue Type: Improvement Components: Build Reporter: Shai Erera Priority: Minor Fix For: 2.9 Attachments: LUCENE-1617.patch, LUCENE-1617.patch One can define testcase to execute just one test class, which is convenient. However, I didn't notice any equivalent for testing a whole package. I find it convenient to be able to test packages rather than test cases because often it is not so clear which test class to run. Following patch allows one to ant test -Dtestpackage=search (for example) and run all tests under the \*/search/\* packages in core, contrib and tags, or do ant test-core -Dtestpackage=search and execute similarly just for core, or do ant test-core -Dtestpacakge=lucene/search/function and run all the tests under \*/lucene/search/function/\* (just in case there is another o.a.l.something.search.function package out there which we want to exclude. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1617) Add testpackage to common-build.xml
[ https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703348#action_12703348 ] Shai Erera commented on LUCENE-1617: bq. I think these classes are quite old and probably never used by anyone anymore. Then perhaps I just delete them? :D If that's not acceptable, I'll run all the tests in core and contrib and rename those that fail. But deleting them really tickles the tip of my fingers ! Add testpackage to common-build.xml - Key: LUCENE-1617 URL: https://issues.apache.org/jira/browse/LUCENE-1617 Project: Lucene - Java Issue Type: Improvement Components: Build Reporter: Shai Erera Priority: Minor Fix For: 2.9 Attachments: LUCENE-1617.patch, LUCENE-1617.patch One can define testcase to execute just one test class, which is convenient. However, I didn't notice any equivalent for testing a whole package. I find it convenient to be able to test packages rather than test cases because often it is not so clear which test class to run. Following patch allows one to ant test -Dtestpackage=search (for example) and run all tests under the \*/search/\* packages in core, contrib and tags, or do ant test-core -Dtestpackage=search and execute similarly just for core, or do ant test-core -Dtestpacakge=lucene/search/function and run all the tests under \*/lucene/search/function/\* (just in case there is another o.a.l.something.search.function package out there which we want to exclude. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1617) Add testpackage to common-build.xml
[ https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703356#action_12703356 ] Michael McCandless commented on LUCENE-1617: Actually I think deleting them is a good idea! Does anyone object? Add testpackage to common-build.xml - Key: LUCENE-1617 URL: https://issues.apache.org/jira/browse/LUCENE-1617 Project: Lucene - Java Issue Type: Improvement Components: Build Reporter: Shai Erera Priority: Minor Fix For: 2.9 Attachments: LUCENE-1617.patch, LUCENE-1617.patch One can define testcase to execute just one test class, which is convenient. However, I didn't notice any equivalent for testing a whole package. I find it convenient to be able to test packages rather than test cases because often it is not so clear which test class to run. Following patch allows one to ant test -Dtestpackage=search (for example) and run all tests under the \*/search/\* packages in core, contrib and tags, or do ant test-core -Dtestpackage=search and execute similarly just for core, or do ant test-core -Dtestpacakge=lucene/search/function and run all the tests under \*/lucene/search/function/\* (just in case there is another o.a.l.something.search.function package out there which we want to exclude. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: new TokenStream api Question
Should I create a patch with something like this? With Expert javadoc, and explanation what is this good for should be a nice addition to Attribute cases. Practically, it would enable specialization of hard linked Attributes like TermAttribute. The only preconditions are: - Specialized Attribute must extend one of the hard linked ones, and provide class of it - Must implement default constructor - should extend by not introducing state (big majority of cases) (not to break captureState()) The last one could be relaxed i guess, but I am not yet 100% familiar with this code. Use cases for this are along the lines of my example, smaller, easier user code and performance (token filters mainly) - Original Message From: Uwe Schindler u...@thetaphi.de To: java-dev@lucene.apache.org Sent: Sunday, 26 April, 2009 23:03:06 Subject: RE: new TokenStream api Question There is one problem: if you extend TermAttribute, the class is different (which is the key in the attributes list). So when you initialize the TokenStream and do a YourClass termAtt = (YourClass) addAttribute(YourClass.class) ...you create a new attribute. So one possibility would be to also specify the instance and save the attribute by class (as key), but with your instance. If you are the first one that creates the attribute (if it is a token stream and not a filter it is ok, you will be the first, it adding the attribute in the ctor), everything is ok. Register the attribute by yourself (maybe we should add a specialized addAttribute, that can specify a instance as default)?: YourClass termAtt = new YourClass(); attributes.put(TermAttribute.class, termAtt); In this case, for the indexer it is a standard TermAttribute, but you can more with it. Replacing TermAttribute by an own class is not possible, as the indexer will get a ClassCastException when using the instance retrieved with getAttribute(TermAttribute.class). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: eks dev [mailto:eks...@yahoo.co.uk] Sent: Sunday, April 26, 2009 10:39 PM To: java-dev@lucene.apache.org Subject: new TokenStream api Question I am just looking into new TermAttribute usage and wonder what would be the best way to implement PrefixFilter that would filter out some Terms that have some prefix, something like this, where '-' represents my prefix: public final boolean incrementToken() throws IOException { // the first word we found while (input.incrementToken()) { int len = termAtt.termLength(); if(len 0 termAtt.termBuffer()[0]!='-') //only length 0 and non LFs return true; // note: else we ignore it } // reached EOS return false; } The question would be: can I extend TermAttribute and add boolean startsWith(char c); The point is speed and my code gets smaller. TermAttribute has one method called in termLength() and termBuffer() I do not understand (back compatibility, I guess) public int termLength() { initTermBuffer(); // I'd like to avoid it... return termLength; } I'd like to get rid of initTermBuffer(), the first option is to *extend* TermAttribute code (but fields are private, so no help there) or can I implement my own MyTermAttribute (will Indexer know how to deal with it?) Must I extend TermAttribute or I can add my own? thanks, eks - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703362#action_12703362 ] Shai Erera commented on LUCENE-1593: bq. I wonder if, instead, we could define an Object getSentinelObject(), returning null by default, and the pqueue on creation would call that and if it's non-null, fill the queue (by calling it maxSize times)? Some extensions of PQ may not know how to construct such a sentinel object. Consider ComparablePQ, which assumes all items are Comparable. Unlike HitQueue, it does not know what will be the items stored in the queue. But ... I guess it can return a Comparable which always prefers the other element ... so maybe that's not a good example. I just have the feeling that a setter method will give us more freedom, rather than having to extend PQ just for that ... bq. Actually shouldn't Weight.scorer(...) in general be the place where such pre-next() initializatoin is done? Ok - so BS's add() is only called from BS2.score(Collector). Therefore BS can be initialized from BS2 directly. Since both are package-private, we should be safe. BS2's add() is called from BooleanWeight.scorer() (I'm sorry if I repeat what you wrote above, but that's just me learning the code), and so can be initialized from there ... hmm I wonder why this wasn't done so far? I'll give it a try. bq. That basically undoes the don't fallback to docID optimization right? Right ... it's too late for me :) bq. I guess since the 6 classes are hidden under the TopFieldCollector.create it's maybe not so bad? It's just that maintaining that class becomes more and more problematic. It already contains 6 inner classes, which duplicate the code to avoid 'if' statements. Meaning every time a bug is found, all 6 need to be checked and fixed. With that proposal, it means 12 ... But I wonder from where would we control it ... IndexSearcher no longer has a ctor which allows to define whether docs should be collected in order or not (the patch removes it). The only other place where it's defined is in BooleanQuery's static setter (affects all boolean queries). But BooleanWeight receives the Collector, and does not create it ... So, if we check in IndexSearcher's search() methods whether this parameter is set or not, we can control the creation of TSDC and TFC. And if anyone else instantiates them on his own, he should know whether he executes searches in-order or not. Back-compat-wise, TFC and TSDC are still in trunk and haven't been released, so we shouldn't have a problem right? Does that sound like a good approach? I still hate to duplicate the code in TFC, but I don't think there's any other choice. Maybe create completely separate classes for TFC and TSDC? although that will make code maintenance even harder. Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory
[ https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703363#action_12703363 ] Yonik Seeley commented on LUCENE-1618: -- I can see how this would potentially be useful for realtime... but it seems like only IndexWriter could eventually fix the situation of having the docstore on disk and the rest of a segment in RAM. Which means that this API shouldn't be public? Allow setting the IndexWriter docstore to be a different directory -- Key: LUCENE-1618 URL: https://issues.apache.org/jira/browse/LUCENE-1618 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Original Estimate: 336h Remaining Estimate: 336h Add an IndexWriter.setDocStoreDirectory method that allows doc stores to be placed in a different directory than the IW default dir. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1617) Add testpackage to common-build.xml
[ https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703365#action_12703365 ] Shai Erera commented on LUCENE-1617: As reference, I ran test-core and test-contrib and these are the problematic classes (from core only): * Test org.apache.lucene.AnalysisTest * Test org.apache.lucene.IndexTest * Test org.apache.lucene.SearchTest * Test org.apache.lucene.StoreTest * Test org.apache.lucene.ThreadSafetyTest Add testpackage to common-build.xml - Key: LUCENE-1617 URL: https://issues.apache.org/jira/browse/LUCENE-1617 Project: Lucene - Java Issue Type: Improvement Components: Build Reporter: Shai Erera Priority: Minor Fix For: 2.9 Attachments: LUCENE-1617.patch, LUCENE-1617.patch One can define testcase to execute just one test class, which is convenient. However, I didn't notice any equivalent for testing a whole package. I find it convenient to be able to test packages rather than test cases because often it is not so clear which test class to run. Following patch allows one to ant test -Dtestpackage=search (for example) and run all tests under the \*/search/\* packages in core, contrib and tags, or do ant test-core -Dtestpackage=search and execute similarly just for core, or do ant test-core -Dtestpacakge=lucene/search/function and run all the tests under \*/lucene/search/function/\* (just in case there is another o.a.l.something.search.function package out there which we want to exclude. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703366#action_12703366 ] Michael McCandless commented on LUCENE-1313: {quote} Agreed that DW can write the segment to the RAMDir. I started coding along these lines however what do we do about the RAMDir merging? This is why I was thinking we'll need a separate IW? Otherwise the ram segments (if they are treated the same as disk segments) would quickly be merged to disk? Or we have two separate merging paths? {quote} Hmm, right. We could exclude RAMDir segments from consideration by MergePolicy? Alternatively, we could expect the MergePolicy to recognize this and be smart about choosing merges (ie don't mix merges)? EG we do in fact want some merging of the RAM segments if they get too numerous (since that will impact search performance). {quote} we should make with NRT is to not close the doc store (stored fields, term vector) files when flushing for an NRT reader. Agreed, I think this feature is a must otherwise we're doing unnecessary in ram merging. {quote} OK, let's do this as a separate issue/optimization for NRT. There are two separate parts to it: * Ability to store doc stores in real directory (looks like you opened LUCENE-1618 for this part). * Ability to share IndexOutput IndexInput {quote} I ran into problems with this before, I was trying to reuse Directory to write a transaction log. It seemed theoretically doable however it didn't work in practice. It could have been the seeking and replacing but I don't remember. FSIndexOutput uses a writeable RAF and FSIndexInput is read only why would there be an issue? {quote} Hmm... seems like we need to investigate further. We could either ask an IndexOutput for its IndexInput (sharing the underlying RAF), or try to separately open an IndexInput (which may not work on Windows). {quote} To implement this functionality in parallel (and perhaps make the overall patch cleaner), writing doc stores directly to a separate directory can be a different patch? There can be an option IW.setDocStoresDirectory(Directory) that the patch implements? Then some unit tests that are separate from the near realtime portion. {quote} Yes, separate issue (LUCENE-1618). Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1597) New Document and Field API
[ https://issues.apache.org/jira/browse/LUCENE-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch reassigned LUCENE-1597: - Assignee: Michael Busch New Document and Field API -- Key: LUCENE-1597 URL: https://issues.apache.org/jira/browse/LUCENE-1597 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Attachments: lucene-new-doc-api.patch This is a super rough prototype of how a new document API could look like. It's basically what I came up with during a long flight across the Atlantic :) It is not integrated with anything yet (like IndexWriter, DocumentsWriter, etc.) and heavily uses Java 1.5 features, such as generics and annotations. The general idea sounds similar to what Marvin is doing in KS, which I found out by reading Mike's comments on LUCENE-831, I haven't looked at the KS API myself yet. Main ideas: - separate a field's value from its configuration; therefore this patch introduces two classes: FieldDescriptor and FieldValue - I was thinking that in most cases the documents people add to a Lucene index look alike, i.e. they contain mostly the same fields with the same settings. Yet, for every field instance the DocumentsWriter checks the settings and calls the right consumers, which themselves check settings and return true or false, indicating whether or not they want to do something with that field or not. So I was thinking we could design the document API similar to the Class-Object concept of OO-languages. There a class is a blueprint (as everyone knows :) ), and an object is one instance of it. So in this patch I introduced a class called DocumentDescriptor, which contains all FieldDescriptors with the field settings. This descriptor is given to the consumer (IndexWriter) once in the constructor. Then the Document instances are created and added via addDocument(). - A Document instance allows adding variable fields in addition to the fixed fields the DocumentDescriptor contains. For these fields the consumers have to check the field settings for every document instance (like with the old document API). This is for maintaining Lucene's flexibility that everyone loves. - Disregard the changes to AttributeSource for now. The code that's worth looking at is contained in a new package newdoc. Again, this is not a real patch, but rather a demo of how a new API could roughly work. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1597) New Document and Field API
[ https://issues.apache.org/jira/browse/LUCENE-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703367#action_12703367 ] Michael Busch commented on LUCENE-1597: --- Thanks for the thorough review, Mike. Reading your response made me really excited, because you exactly understood most of the thoughts I put into this code, without me even mentioning them :) Thanks for writing them down! I started including your suggestions into my patch and will reply with more detail to your individual points as I'm working on them. New Document and Field API -- Key: LUCENE-1597 URL: https://issues.apache.org/jira/browse/LUCENE-1597 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael Busch Priority: Minor Attachments: lucene-new-doc-api.patch This is a super rough prototype of how a new document API could look like. It's basically what I came up with during a long flight across the Atlantic :) It is not integrated with anything yet (like IndexWriter, DocumentsWriter, etc.) and heavily uses Java 1.5 features, such as generics and annotations. The general idea sounds similar to what Marvin is doing in KS, which I found out by reading Mike's comments on LUCENE-831, I haven't looked at the KS API myself yet. Main ideas: - separate a field's value from its configuration; therefore this patch introduces two classes: FieldDescriptor and FieldValue - I was thinking that in most cases the documents people add to a Lucene index look alike, i.e. they contain mostly the same fields with the same settings. Yet, for every field instance the DocumentsWriter checks the settings and calls the right consumers, which themselves check settings and return true or false, indicating whether or not they want to do something with that field or not. So I was thinking we could design the document API similar to the Class-Object concept of OO-languages. There a class is a blueprint (as everyone knows :) ), and an object is one instance of it. So in this patch I introduced a class called DocumentDescriptor, which contains all FieldDescriptors with the field settings. This descriptor is given to the consumer (IndexWriter) once in the constructor. Then the Document instances are created and added via addDocument(). - A Document instance allows adding variable fields in addition to the fixed fields the DocumentDescriptor contains. For these fields the consumers have to check the field settings for every document instance (like with the old document API). This is for maintaining Lucene's flexibility that everyone loves. - Disregard the changes to AttributeSource for now. The code that's worth looking at is contained in a new package newdoc. Again, this is not a real patch, but rather a demo of how a new API could roughly work. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703368#action_12703368 ] Earwin Burrfoot commented on LUCENE-1593: - Use FMPP? It is pretty nice and integrates well into maven/ant builds. I'm using it for primitive-specialized fieldcaches. Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory
[ https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703370#action_12703370 ] Michael McCandless commented on LUCENE-1618: Yeah I also think this should be an under the hood (done only by NRT) optimization inside IndexWriter. The only possible non-NRT case I can think of is when users make temporary indices in RAM, it's possible one would want to write the docStore files to an FSDirectory (because they are so large) but keep postings, norms, deletes, etc in RAM. But going down that road opens up a can of worms... eg does segments_N somehow have to keep track of which dir has which parts of a segment? Suddenly IndexReader must also know to look in different dirs for different parts of a segment, etc. it might be cleaner to make a Directory impl that dispatches certain files to a RAMDir and others to an FSDir, so IndexWriter/IndexReader still see a single Directory API. Allow setting the IndexWriter docstore to be a different directory -- Key: LUCENE-1618 URL: https://issues.apache.org/jira/browse/LUCENE-1618 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Original Estimate: 336h Remaining Estimate: 336h Add an IndexWriter.setDocStoreDirectory method that allows doc stores to be placed in a different directory than the IW default dir. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1617) Add testpackage to common-build.xml
[ https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1617: --- Attachment: LUCENE-1617.patch This one removes the aforementioned test classes (that are not really tests), in case everybody agrees. Add testpackage to common-build.xml - Key: LUCENE-1617 URL: https://issues.apache.org/jira/browse/LUCENE-1617 Project: Lucene - Java Issue Type: Improvement Components: Build Reporter: Shai Erera Priority: Minor Fix For: 2.9 Attachments: LUCENE-1617.patch, LUCENE-1617.patch, LUCENE-1617.patch One can define testcase to execute just one test class, which is convenient. However, I didn't notice any equivalent for testing a whole package. I find it convenient to be able to test packages rather than test cases because often it is not so clear which test class to run. Following patch allows one to ant test -Dtestpackage=search (for example) and run all tests under the \*/search/\* packages in core, contrib and tags, or do ant test-core -Dtestpackage=search and execute similarly just for core, or do ant test-core -Dtestpacakge=lucene/search/function and run all the tests under \*/lucene/search/function/\* (just in case there is another o.a.l.something.search.function package out there which we want to exclude. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703373#action_12703373 ] Shai Erera commented on LUCENE-1593: Forgive my ignorance, but what is FMPP? And to which of the above is it related? :) Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1616) add one setter for start and end offset to OffsetAttribute
[ https://issues.apache.org/jira/browse/LUCENE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703374#action_12703374 ] Michael McCandless commented on LUCENE-1616: OK all tests pass. I had to fix a few back-compat tests (that were using the new TokenStream API, I think because we created the back-compat branch from trunk after the new TokenStream API landed). I'll commit in a day or two. Thanks Eks! add one setter for start and end offset to OffsetAttribute -- Key: LUCENE-1616 URL: https://issues.apache.org/jira/browse/LUCENE-1616 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Eks Dev Assignee: Michael McCandless Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch, LUCENE-1616.patch add OffsetAttribute. setOffset(startOffset, endOffset); trivial change, no JUnit needed Changed CharTokenizer to use it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory
[ https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703375#action_12703375 ] Jason Rutherglen commented on LUCENE-1618: -- {quote} non-NRT case I can think of is when users make temporary indices in RAM {quote} Yes, and there could be others we don't know about. {quote} it might be cleaner to make a Directory impl that dispatches certain files to a RAMDir and others to an FSDir {quote} Good idea. I'll try that method first. If this one works out, then the API will be public? Allow setting the IndexWriter docstore to be a different directory -- Key: LUCENE-1618 URL: https://issues.apache.org/jira/browse/LUCENE-1618 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Original Estimate: 336h Remaining Estimate: 336h Add an IndexWriter.setDocStoreDirectory method that allows doc stores to be placed in a different directory than the IW default dir. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703377#action_12703377 ] Michael McCandless commented on LUCENE-1313: So let's leave this issue focused on sometimes using RAMDir for newly created segments. Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1617) Add testpackage to common-build.xml
[ https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-1617: -- Assignee: Michael McCandless Add testpackage to common-build.xml - Key: LUCENE-1617 URL: https://issues.apache.org/jira/browse/LUCENE-1617 Project: Lucene - Java Issue Type: Improvement Components: Build Reporter: Shai Erera Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1617.patch, LUCENE-1617.patch, LUCENE-1617.patch One can define testcase to execute just one test class, which is convenient. However, I didn't notice any equivalent for testing a whole package. I find it convenient to be able to test packages rather than test cases because often it is not so clear which test class to run. Following patch allows one to ant test -Dtestpackage=search (for example) and run all tests under the \*/search/\* packages in core, contrib and tags, or do ant test-core -Dtestpackage=search and execute similarly just for core, or do ant test-core -Dtestpacakge=lucene/search/function and run all the tests under \*/lucene/search/function/\* (just in case there is another o.a.l.something.search.function package out there which we want to exclude. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1617) Add testpackage to common-build.xml
[ https://issues.apache.org/jira/browse/LUCENE-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703381#action_12703381 ] Michael McCandless commented on LUCENE-1617: OK looks good. I'll wait a day or two before committing. Thanks Shai! Add testpackage to common-build.xml - Key: LUCENE-1617 URL: https://issues.apache.org/jira/browse/LUCENE-1617 Project: Lucene - Java Issue Type: Improvement Components: Build Reporter: Shai Erera Priority: Minor Fix For: 2.9 Attachments: LUCENE-1617.patch, LUCENE-1617.patch, LUCENE-1617.patch One can define testcase to execute just one test class, which is convenient. However, I didn't notice any equivalent for testing a whole package. I find it convenient to be able to test packages rather than test cases because often it is not so clear which test class to run. Following patch allows one to ant test -Dtestpackage=search (for example) and run all tests under the \*/search/\* packages in core, contrib and tags, or do ant test-core -Dtestpackage=search and execute similarly just for core, or do ant test-core -Dtestpacakge=lucene/search/function and run all the tests under \*/lucene/search/function/\* (just in case there is another o.a.l.something.search.function package out there which we want to exclude. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703382#action_12703382 ] Earwin Burrfoot commented on LUCENE-1593: - bq. Forgive my ignorance, but what is FMPP? Forgive my laziness, http://fmpp.sourceforge.net/ - What is FMPP? FMPP is a general-purpose text file preprocessor tool that uses FreeMarker templates. bq. And to which of the above is it related? to this bq. It's just that maintaining that class becomes more and more problematic. It already contains 6 inner classes, which duplicate the code to avoid 'if' statements. Meaning every time a bug is found, all 6 need to be checked and fixed. With that proposal, it means 12 ... Mike experimented with generated code for specialized search, I see no reasons not to use the same approach for cases where you're already hand-coding N almost-identical classes. You're generating query parser after all :) For official release FMPP is superior to Python as it can be bundled in a crossplatform manner. Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory
[ https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703384#action_12703384 ] Tim Smith commented on LUCENE-1618: --- Would also further suggest that this Directory implementation would take one or more directories to store documents, along with one or more directories to store the index itself one of the directories should be explicitly marked for reading for each use this allows creating a Directory instance that will: * store documents to disk (reading from disk during searches) * write index to disk and ram (reading from RAM during searches) Allow setting the IndexWriter docstore to be a different directory -- Key: LUCENE-1618 URL: https://issues.apache.org/jira/browse/LUCENE-1618 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Original Estimate: 336h Remaining Estimate: 336h Add an IndexWriter.setDocStoreDirectory method that allows doc stores to be placed in a different directory than the IW default dir. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703387#action_12703387 ] Jason Rutherglen commented on LUCENE-1313: -- {quote} We could exclude RAMDir segments from consideration by MergePolicy? Alternatively, we could expect the MergePolicy to recognize this and be smart about choosing merges (ie don't mix merges)? {quote} Is this over complicating things? Sometimes we want a mixture of RAMDir segments and FSDir segments to merge (when we've decided we have too much in ram), sometimes we don't (when we just want the ram segments to merge). I'm still a little confused as to why having a wrapper class that manages a disk writer and a ram writer isn't cleaner? Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory
[ https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703388#action_12703388 ] Michael McCandless commented on LUCENE-1618: {quote} it might be cleaner to make a Directory impl that dispatches certain files to a RAMDir and others to an FSDir Good idea. I'll try that method first. If this one works out, then the API will be public? {quote} Which API would be public? If this (call it FileSwitchDirectory for now ;) ) works then we would not add any API to IndexWriter (ie it's either or)? But FileSwitchDirectory would be public expert. One downside to this approach is it's brittle -- whenever we change file extensions you'd have to know to fix this Directory. Or maybe we make the Directory specialized to only storing the doc stores in the FSDir, then whenever we change file formats we would fix this directory? But in the future, with custom codecs, things could be named whatever... hmmm. Lacking clarity. Allow setting the IndexWriter docstore to be a different directory -- Key: LUCENE-1618 URL: https://issues.apache.org/jira/browse/LUCENE-1618 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Original Estimate: 336h Remaining Estimate: 336h Add an IndexWriter.setDocStoreDirectory method that allows doc stores to be placed in a different directory than the IW default dir. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1597) New Document and Field API
[ https://issues.apache.org/jira/browse/LUCENE-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703391#action_12703391 ] Michael Busch commented on LUCENE-1597: --- {quote} How would you turn on/off [future] CSF storage? A separate attr? A boolean on StoreAttribute? {quote} I was thinking about adding a separate attribute. But here is one thing I haven't figured out yet: it should actually be perfectly fine to store a value in a CSF and *also* in the 'normal' store. The problem is that the type of data input is the limiting factor here: if the user provides the data as a byte array, then everything works fine. However, if the data is provide as a Reader, then it's not guaranteed that the reader can be read more than once. To implement reset() is optional, as the javadocs say. So maybe we should state in our javadocs that a reader must support reset(), otherwise writing the data into more than one data structures will result in an undefined behavior? Alternatively we could introduce a new class called ResetableReader, where reset() is abstract, and change the API in 3.0 to only accept that type of reader? Btw. the same is true for fields that provide the data as a TokenStream. New Document and Field API -- Key: LUCENE-1597 URL: https://issues.apache.org/jira/browse/LUCENE-1597 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Attachments: lucene-new-doc-api.patch This is a super rough prototype of how a new document API could look like. It's basically what I came up with during a long flight across the Atlantic :) It is not integrated with anything yet (like IndexWriter, DocumentsWriter, etc.) and heavily uses Java 1.5 features, such as generics and annotations. The general idea sounds similar to what Marvin is doing in KS, which I found out by reading Mike's comments on LUCENE-831, I haven't looked at the KS API myself yet. Main ideas: - separate a field's value from its configuration; therefore this patch introduces two classes: FieldDescriptor and FieldValue - I was thinking that in most cases the documents people add to a Lucene index look alike, i.e. they contain mostly the same fields with the same settings. Yet, for every field instance the DocumentsWriter checks the settings and calls the right consumers, which themselves check settings and return true or false, indicating whether or not they want to do something with that field or not. So I was thinking we could design the document API similar to the Class-Object concept of OO-languages. There a class is a blueprint (as everyone knows :) ), and an object is one instance of it. So in this patch I introduced a class called DocumentDescriptor, which contains all FieldDescriptors with the field settings. This descriptor is given to the consumer (IndexWriter) once in the constructor. Then the Document instances are created and added via addDocument(). - A Document instance allows adding variable fields in addition to the fixed fields the DocumentDescriptor contains. For these fields the consumers have to check the field settings for every document instance (like with the old document API). This is for maintaining Lucene's flexibility that everyone loves. - Disregard the changes to AttributeSource for now. The code that's worth looking at is contained in a new package newdoc. Again, this is not a real patch, but rather a demo of how a new API could roughly work. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703396#action_12703396 ] Shai Erera commented on LUCENE-1593: bq.EG BooleanWeight.scorer(...) should call BS2's initCountingSumScorer (and/or somehow forward to BS)? Ok that somehow forward to BS is more problematic than I thought so initially. BS2.score(Collector) determines whether to instantiate a new BS, add the scorers and call bs.score(Collector), or to execute the score itself. On the other hand, it uses the same scorers in next() and skipTo(). Therefore there's kind of a mutual exclusiveness here: either the scorers are used by BS or by BS2. They cannot be used by both, unless we clone() them. If we want to clone them, we need to: * Create a BS in init(). * Clone all the Scorers and pass them to BS. * Initialize BS2's countingSumScorer. * In score(Collector) use the class member of BS. bq. hmm I wonder why this wasn't done so far? I think I understand now ... the decision on which path to take can only be determined after score(Collector) is called, or next()/skipTo(). Before that, i.e., when BW returns BS2 it does not know how it will be used, right? The decision is made by IndexSearcher.doSearch depending on whether there's a filter (next()/skipTo() are used) or not (score(Collector)). So perhaps we should revert back to having start() on DISI? Since IndexSearcher can call start before iterating over the docs, but not if it uses scorer.score(Collector), which is delegated to the scorer. In that case, we should check whether the countingSumScorer was initialized and if not initialize it outselves. Am I missing something? Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory
[ https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703406#action_12703406 ] Eks Dev commented on LUCENE-1618: - Maybe, FileSwitchDirectory should have possibility to get file list/extensions that should be loaded into RAM... making it maintenance free, pushing this decision to end user... if, and when we decide to support users in it, we could than maintain static list at separate place . Kind of separate execution and configuration I *think* I saw something similar Ning Lee made quite a while ago, from hadoop camp (indexing on hadoop something...). But cannot remember what was it :( Allow setting the IndexWriter docstore to be a different directory -- Key: LUCENE-1618 URL: https://issues.apache.org/jira/browse/LUCENE-1618 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Original Estimate: 336h Remaining Estimate: 336h Add an IndexWriter.setDocStoreDirectory method that allows doc stores to be placed in a different directory than the IW default dir. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703404#action_12703404 ] Michael McCandless commented on LUCENE-1593: {quote} Some extensions of PQ may not know how to construct such a sentinel object. Consider ComparablePQ, which assumes all items are Comparable. Unlike HitQueue, it does not know what will be the items stored in the queue. But ... I guess it can return a Comparable which always prefers the other element ... so maybe that's not a good example. I just have the feeling that a setter method will give us more freedom, rather than having to extend PQ just for that ... {quote} Such extensions shouldn't use a sentinel? Various things spook me about the separate method: one could easily pass in bad sentinels, and then the queue is in an invalid state; the method can be called at any time (whereas the only time you should do this is on init); you could pass in wrong-sized array; the API is necessarily public (whereas with getSentinel() it'd be protected). We can mull it over some more... sleep on it ;) bq. Right ... it's too late for me I've been starting to wonder if you are a robot... bq. hmm I wonder why this wasn't done so far? I don't know! Seems like a simple optimization. So we don't need start/end (now at least). bq. oIt's just that maintaining that class becomes more and more problematic. I completely agree: this is the tradeoff we have to mull. But I like that all these classes are private (it hides the fact that there are 12 concrete impls). I think I'd lean towards the 12 impls now. They are tiny classes. bq. But I wonder from where would we control it Hmm.. yeah good point. The only known place in Lucene's core that visits hits out of order is BooleanScorer. But presumably an external Query somewhere may provide a Scorer that does things out of order (maybe Solr does?), and so technically making the core collectors not break ties by docID by default is a break in back-compat. Maybe we should add a docsInOrder() method to Scorer? By default it returns false, but we fix that to return true for all core Lucene queries? And then IndexSearcher consults that to decide whether it can do this? Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1597) New Document and Field API
[ https://issues.apache.org/jira/browse/LUCENE-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703407#action_12703407 ] Michael Busch commented on LUCENE-1597: --- {quote} Can we maybe rename Descriptor - Type? Eg FieldDescriptor - FieldType? {quote} Done. {quote} Can a single FieldDescriptor be shared among many fields? Seems like we'd have to take name out of FieldDescriptor (I don't think the name should be in FieldDescriptor, anyway). {quote} I agree, this should be possible. I removed the name. {quote} NumericFieldAttribute seems awkward (one shouldn't have to turn on/off zero padding, trie; or rather it's better to operate in use cases like I want to do range filtering or I want to sort). Seems like maybe we need a SortAttribute and RangeFilterAttribute (or... something). {quote} Yep I agree. Some things in this prototype are quite goofy, because I wanted to mainly demonstrate the main ideas. The attributes you suggest make sense to me. New Document and Field API -- Key: LUCENE-1597 URL: https://issues.apache.org/jira/browse/LUCENE-1597 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Attachments: lucene-new-doc-api.patch This is a super rough prototype of how a new document API could look like. It's basically what I came up with during a long flight across the Atlantic :) It is not integrated with anything yet (like IndexWriter, DocumentsWriter, etc.) and heavily uses Java 1.5 features, such as generics and annotations. The general idea sounds similar to what Marvin is doing in KS, which I found out by reading Mike's comments on LUCENE-831, I haven't looked at the KS API myself yet. Main ideas: - separate a field's value from its configuration; therefore this patch introduces two classes: FieldDescriptor and FieldValue - I was thinking that in most cases the documents people add to a Lucene index look alike, i.e. they contain mostly the same fields with the same settings. Yet, for every field instance the DocumentsWriter checks the settings and calls the right consumers, which themselves check settings and return true or false, indicating whether or not they want to do something with that field or not. So I was thinking we could design the document API similar to the Class-Object concept of OO-languages. There a class is a blueprint (as everyone knows :) ), and an object is one instance of it. So in this patch I introduced a class called DocumentDescriptor, which contains all FieldDescriptors with the field settings. This descriptor is given to the consumer (IndexWriter) once in the constructor. Then the Document instances are created and added via addDocument(). - A Document instance allows adding variable fields in addition to the fixed fields the DocumentDescriptor contains. For these fields the consumers have to check the field settings for every document instance (like with the old document API). This is for maintaining Lucene's flexibility that everyone loves. - Disregard the changes to AttributeSource for now. The code that's worth looking at is contained in a new package newdoc. Again, this is not a real patch, but rather a demo of how a new API could roughly work. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703410#action_12703410 ] Michael McCandless commented on LUCENE-1593: bq. Use FMPP? I think the first question we need to answer is whether we cutover to specialization for this. At this point I don't think we need to, yet (I think the 12 classes is tolerable, since they are tiny and private). The second question is, if we do switch to specialization at some point (which I think we should: the performance gains are sizable), how should we do the generation (Python, Java, FMPP, XSLT, etc.). I think it's a long time before we need to make that decision (many iterations remain on LUCENE-1594). Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703412#action_12703412 ] Michael McCandless commented on LUCENE-1593: bq. the decision on which path to take can only be determined after score(Collector) is called, or next()/skipTo(). Oh I see: the Scorer cannot know on creation if it's the top scorer (score(Collector) will be called), or a secondary one (next()/skipTo(...) will be called). Hmm yeah maybe back to DISI.start(). I think as long the actual code that will next()/skipTo(...) through the iterator is the only one that calls start(), the BS/BS2 double-start problem won't happen? Really, somehow, it should be explicit when a Scorer will be topmost. IndexSearcher knows this when it creates it. Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory
[ https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703415#action_12703415 ] Michael McCandless commented on LUCENE-1618: bq. Would also further suggest that this Directory implementation would take one or more directories to store documents, along with one or more directories to store the index itself You mean an opened IndexOutput would write its output to two (or more) different places? So you could write through a RAMDir down to an FSDir? (This way both the RAMDir and FSDir have a copy of the index). Allow setting the IndexWriter docstore to be a different directory -- Key: LUCENE-1618 URL: https://issues.apache.org/jira/browse/LUCENE-1618 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Original Estimate: 336h Remaining Estimate: 336h Add an IndexWriter.setDocStoreDirectory method that allows doc stores to be placed in a different directory than the IW default dir. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1618) Allow setting the IndexWriter docstore to be a different directory
[ https://issues.apache.org/jira/browse/LUCENE-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703416#action_12703416 ] Michael McCandless commented on LUCENE-1618: {quote} ileSwitchDirectory should have possibility to get file list/extensions that should be loaded into RAM... making it maintenance free, pushing this decision to end user... if, and when we decide to support users in it, we could than maintain static list at separate place . Kind of separate execution and configuration {quote} +1 With flexible indexing, presumably one could use their codec to ask it for the doc store extensions vs the postings extensions, etc., and pass to this configurable FileSwitchDirectory. Allow setting the IndexWriter docstore to be a different directory -- Key: LUCENE-1618 URL: https://issues.apache.org/jira/browse/LUCENE-1618 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Original Estimate: 336h Remaining Estimate: 336h Add an IndexWriter.setDocStoreDirectory method that allows doc stores to be placed in a different directory than the IW default dir. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703419#action_12703419 ] Michael McCandless commented on LUCENE-1313: {quote} Sometimes we want a mixture of RAMDir segments and FSDir segments to merge (when we've decided we have too much in ram), {quote} I don't think we want to mix RAM disk merging? EG when RAM is full, we want to quickly flush it to disk as a single segment. Merging with disk segments only makes that flush slower? {quote} I'm still a little confused as to why having a wrapper class that manages a disk writer and a ram writer isn't cleaner? {quote} This is functionally the same as not mixing RAM vs disk merging, right (ie just as clean)? Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RangeQuery and getTerm
RangeQuery is based on two terms rather than one, and currently returns null from getTerm. This can lead to less than obvious null pointer exceptions. I'd almost prefer to throw UnsupportedOperationException. However, returning null allows you to still use getTerm on MultiTermQuery and do a null check in the RangeQuery case. Not sure how valuable that really is though. Thoughts? -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Build failed in Hudson: Lucene-trunk #810
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/810/changes Changes: [mikemccand] LUCENE-1615: remove some more deprecated uses of Fieldable.omitTf [mikemccand] remove redundant CHANGES entries from trunk if they are already covered in 2.4.1 -- [...truncated 2887 lines...] compile-test: [echo] Building benchmark... javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: compile-demo: javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: init: clover.setup: clover.info: clover: common.compile-core: compile-core: compile-demo: compile-highlighter: [echo] Building highlighter... build-memory: javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: build-lucene: build-lucene-tests: init: clover.setup: clover.info: clover: common.compile-core: compile-core: compile: check-files: init: clover.setup: clover.info: clover: compile-core: common.compile-test: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/benchmark/classes/test [javac] Compiling 9 source files to http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/benchmark/classes/test [javac] Note: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/benchmark/src/test/org/apache/lucene/benchmark/quality/TestQualityRun.java uses or overrides a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [copy] Copying 2 files to http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/benchmark/classes/test build-artifacts-and-tests: [echo] Building collation... javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: compile-misc: [echo] Building misc... javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: build-lucene: build-lucene-tests: init: clover.setup: clover.info: clover: compile-core: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc/classes/java [javac] Compiling 16 source files to http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc/classes/java [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. compile: init: clover.setup: clover.info: clover: compile-core: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/classes/java [javac] Compiling 4 source files to http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/classes/java [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. jar-core: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/lucene-collation-2.4-SNAPSHOT.jar jar: compile-test: [echo] Building collation... javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: compile-misc: [echo] Building misc... javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: build-lucene: build-lucene-tests: init: clover.setup: clover.info: clover: compile-core: compile: init: clover.setup: clover.info: clover: compile-core: common.compile-test: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/classes/test [javac] Compiling 5 source files to http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/collation/classes/test [javac] Note: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/collation/src/test/org/apache/lucene/collation/CollationTestBase.java uses or overrides a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. build-artifacts-and-tests: bdb: [echo] Building bdb... javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: build-lucene: build-lucene-tests: contrib-build.init: get-db-jar: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/db/bdb/lib [get] Getting: http://downloads.osafoundation.org/db/db-4.7.25.jar [get] To: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/db/bdb/lib/db-4.7.25.jar [get] Error getting http://downloads.osafoundation.org/db/db-4.7.25.jar to http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/db/bdb/lib/db-4.7.25.jar BUILD FAILED http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build.xml :628: The following error occurred while
Re: [Lucene-java Wiki] Update of LuceneAtApacheConUs2009 by MichaelBusch
I'm happy to give more than one talk, on the other hand I don't want to prevent others from presenting. So if anyone likes to give similar talks to the ones I suggested, please let us know. -Michael On 4/27/09 10:07 PM, Apache Wiki wrote: Dear Wiki user, You have subscribed to a wiki page or wiki category on Lucene-java Wiki for change notification. The following page has been changed by MichaelBusch: http://wiki.apache.org/jakarta-lucene/LuceneAtApacheConUs2009 -- Let's wait to fill this in until Concom provides us a list from the regular CFP process. = Possible Talks or Tutorials = - * Lucene Basics (Michael Busch) + * Lucene Basics (Michael Busch or others?) * Intro to Solr (: Hoss out of the box talk?) * Intro to Nutch and/or Nutch Vertical Search (Andrzej Bialecki) (when was the last time we had a Nutch talk? ''probably never...'') * Mime Magic with Apache Tika (Jukka Zitting) @@ -34, +34 @@ - * New Features in Lucene (Michael Busch) + * New Features in Lucene (Michael Busch or others?) * Advanced Lucene Indexing (Michael Busch) * Building Intelligent Search Applications with the Lucene Ecosystem (Grant Ingersoll) - see abstract at bottom * Solr Operations and Performance Tuning - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1567) New flexible query parser
[ https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703504#action_12703504 ] Bertrand Delacretaz commented on LUCENE-1567: - Grant, the ip-clearance document that you created under incubator-public in svn had not been added to the site-publish folder, I just did that in revision 769253. If that's not correct, please remove both xml and html versions of the lucene-query-parser file there. New flexible query parser - Key: LUCENE-1567 URL: https://issues.apache.org/jira/browse/LUCENE-1567 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Environment: N/A Reporter: Luis Alves Assignee: Grant Ingersoll Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, lucene_trunk_FlexQueryParser_2009March26_v3.patch From New flexible query parser thread by Micheal Busch in my team at IBM we have used a different query parser than Lucene's in our products for quite a while. Recently we spent a significant amount of time in refactoring the code and designing a very generic architecture, so that this query parser can be easily used for different products with varying query syntaxes. This work was originally driven by Andreas Neumann (who, however, left our team); most of the code was written by Luis Alves, who has been a bit active in Lucene in the past, and Adriano Campos, who joined our team at IBM half a year ago. Adriano is Apache committer and PMC member on the Tuscany project and getting familiar with Lucene now too. We think this code is much more flexible and extensible than the current Lucene query parser, and would therefore like to contribute it to Lucene. I'd like to give a very brief architecture overview here, Adriano and Luis can then answer more detailed questions as they're much more familiar with the code than I am. The goal was it to separate syntax and semantics of a query. E.g. 'a AND b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query. We distinguish the semantics of the different query components, e.g. whether and how to tokenize/lemmatize/normalize the different terms or which Query objects to create for the terms. We wanted to be able to write a parser with a new syntax, while reusing the underlying semantics, as quickly as possible. In fact, Adriano is currently working on a 100% Lucene-syntax compatible implementation to make it easy for people who are using Lucene's query parser to switch. The query parser has three layers and its core is what we call the QueryNodeTree. It is a tree that initially represents the syntax of the original query, e.g. for 'a AND b': AND / \ A B The three layers are: 1. QueryParser 2. QueryNodeProcessor 3. QueryBuilder 1. The upper layer is the parsing layer which simply transforms the query text string into a QueryNodeTree. Currently our implementations of this layer use javacc. 2. The query node processors do most of the work. It is in fact a configurable chain of processors. Each processors can walk the tree and modify nodes or even the tree's structure. That makes it possible to e.g. do query optimization before the query is executed or to tokenize terms. 3. The third layer is also a configurable chain of builders, which transform the QueryNodeTree into Lucene Query objects. Furthermore the query parser uses flexible configuration objects, which are based on AttributeSource/Attribute. It also uses message classes that allow to attach resource bundles. This makes it possible to translate messages, which is an important feature of a query parser. This design allows us to develop different query syntaxes very quickly. Adriano wrote the Lucene-compatible syntax in a matter of hours, and the underlying processors and builders in a few days. We now have a 100% compatible Lucene query parser, which means the syntax is identical and all query parser test cases pass on the new one too using a wrapper. Recent posts show that there is demand for query syntax improvements, e.g improved range query syntax or operator precedence. There are already different QP implementations in Lucene+contrib, however I think we did not keep them all up to date and in sync. This is not too surprising, because usually when fixes and changes are made to the main query parser, people don't make the corresponding changes in the contrib parsers. (I'm guilty here too) With this new architecture it will be much easier to maintain different query syntaxes, as the actual code for the first layer is not very much. All syntaxes would benefit from patches and improvements we make to the underlying layers, which will make supporting different syntaxes much more manageable. -- This message is automatically