[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744970#action_12744970 ] Michael McCandless commented on LUCENE-1821: BTW contrib/spatial has exactly this same problem. It currently builds up a cache, keyed on the top (MultiReader's) docID, of the precise distance computed by its precise distance filters, to then be used during sorting. Right now it simply computes its own docBase and increments it every time getDocIdSet() is called (which is messy). Though I think it could (and should) switch to a per-segment cache. I am torn. On the one hand we don't want to encourage apps to be using top docIDs anywhere down low (eg Weight/Scorer). We'd like all such per-segment swtiching to happen up high. But on the other hand, this is quite a sudden change, and most advanced apps will be using the top docIDs by definition (since per-segment docIDs only becomes an [easy] option in 2.9), so it'd be more friendly to offer up a cleaner migration path for such apps where Weight/Scorer is told its docBase. And, having to migrate an ord index from top to sub docIDs is truly a nightmare, having gone through that with Mark in getting String sorting to work per segment! Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1794) implement reusableTokenStream for all contrib analyzers
[ https://issues.apache.org/jira/browse/LUCENE-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-1794. - Resolution: Fixed Committed revision 805766. implement reusableTokenStream for all contrib analyzers --- Key: LUCENE-1794 URL: https://issues.apache.org/jira/browse/LUCENE-1794 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 2.9 Attachments: LUCENE-1794-reusing-analyzer.patch, LUCENE-1794.patch, LUCENE-1794.patch, LUCENE-1794.patch, LUCENE-1794.patch, LUCENE-1794.patch, LUCENE-1794.patch, LUCENE-1794_fix.patch, LUCENE-1794_fix2.txt most contrib analyzers do not have an impl for reusableTokenStream regardless of how expensive the back compat reflection is for indexing speed, I think we should do this to mitigate any performance costs. hey, overall it might even be an improvement! the back compat code for non-final analyzers is already in place so this is easy money in my opinion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1813) Add option to ReverseStringFilter to mark reversed tokens
[ https://issues.apache.org/jira/browse/LUCENE-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-1813. - Resolution: Fixed Committed revision 805769. Thanks Andrzej and also everyone who provided feedback Add option to ReverseStringFilter to mark reversed tokens - Key: LUCENE-1813 URL: https://issues.apache.org/jira/browse/LUCENE-1813 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.9 Reporter: Andrzej Bialecki Assignee: Robert Muir Fix For: 2.9 Attachments: LUCENE-1813.patch, LUCENE-1813.patch, LUCENE-1813.patch, reverseMark-2.patch, reverseMark.patch This patch implements additional functionality in the filter to mark reversed tokens with a special marker character (Unicode 0001). This is useful when indexing both straight and reversed tokens (e.g. to implement efficient leading wildcards search). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745036#action_12745036 ] Tim Smith commented on LUCENE-1821: --- Concerning the changelog, i feel the below should be added to the Changes in runtime behavior section (it's kinda specified in New features, however it is also a rather substantial change in the runtime behavior and should be called out explicitly there) {code} 13. LUCENE-1483: When searching over multiple segments, a new Scorer is created for each segment. The Weight is created only once for the top level searcher. Each Scorer is passed the per-segment IndexReader. This will result in docids in the Scorer being internal to the per-segment IndexReader and there is currently no way to rebase these docids to the top level IndexReader. This results in any caches/filters that use docids over the top IndexReader to be broken. {code} Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745039#action_12745039 ] Mark Miller commented on LUCENE-1821: - I think thats a good idea. I think that last sentence needs a bit of work. Here is another attempt that I am still not quite happy with: {code}13. LUCENE-1483: When searching over multiple segments, a new Scorer is created for each segment. The Weight is created only once for the top level searcher. Each Scorer is passed the per-segment IndexReader. This will result in docids in the Scorer being internal to the per-segment IndexReader and there is currently no way to rebase these docids to the top level IndexReader. This will likely break any caches/filters in Scorers that rely on docids from the top level IndexReader eg if you rely on the IndexReader to contain every doc id in the index.{code} Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1824) FastVectorHighlighter truncates words at beginning and end of fragments
[ https://issues.apache.org/jira/browse/LUCENE-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Vigdor updated LUCENE-1824: Attachment: LUCENE-1824-test.patch FastVectorHighlighter truncates words at beginning and end of fragments --- Key: LUCENE-1824 URL: https://issues.apache.org/jira/browse/LUCENE-1824 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Environment: any Reporter: Alex Vigdor Priority: Minor Fix For: 3.1 Attachments: LUCENE-1824-test.patch, LUCENE-1824.patch FastVectorHighlighter does not take word boundaries into consideration when building fragments, so that in most cases the first and last word of a fragment are truncated. This makes the highlights less legible than they should be. I will attach a patch to BaseFragmentBuilder that resolves this by expanding the start and end boundaries of the fragment to the first whitespace character on either side of the fragment, or the beginning or end of the source text, whichever comes first. This significantly improves legibility, at the cost of returning a slightly larger number of characters than specified for the fragment size. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1824) FastVectorHighlighter truncates words at beginning and end of fragments
[ https://issues.apache.org/jira/browse/LUCENE-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Vigdor updated LUCENE-1824: Attachment: (was: LUCENE-1824-test.patch) FastVectorHighlighter truncates words at beginning and end of fragments --- Key: LUCENE-1824 URL: https://issues.apache.org/jira/browse/LUCENE-1824 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Environment: any Reporter: Alex Vigdor Priority: Minor Fix For: 3.1 Attachments: LUCENE-1824-test.patch, LUCENE-1824.patch FastVectorHighlighter does not take word boundaries into consideration when building fragments, so that in most cases the first and last word of a fragment are truncated. This makes the highlights less legible than they should be. I will attach a patch to BaseFragmentBuilder that resolves this by expanding the start and end boundaries of the fragment to the first whitespace character on either side of the fragment, or the beginning or end of the source text, whichever comes first. This significantly improves legibility, at the cost of returning a slightly larger number of characters than specified for the fragment size. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745041#action_12745041 ] Tim Smith commented on LUCENE-1821: --- One more pass {code} 13. LUCENE-1483: When searching over multiple segments, a new Scorer is created for each segment. The Weight is created only once for the top level searcher. Each Scorer is passed the per-segment IndexReader. This will result in docids in the Scorer being internal to the per-segment IndexReader. If a custom Scorer implementation uses any caches/filters based on the top level IndexReader/Searcher, it will need to be updated to use caches/filters on a per segment basis. There is currently no way provided to rebase the docids in the Scorer to the top level IndexReader. See LUCENE-1821 for discussion on workarounds for this. {code} Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-1821: -- Attachment: LUCENE-1821.patch Here's a patch that adds getIndexReaderBase(IndexReader reader) to IndexSearcher sadly, this cannot be easily added to MultiSearcher as well as it uses Searchables, which would require adding this method to the Searchable interface I could work up another patch that adds this method to the Searchable interface, however that has some back-compat concerns Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745044#action_12745044 ] Mark Miller commented on LUCENE-1821: - Looks great! I still almost want to say rely on though: bq. uses any caches/filters based on the top level IndexReader/Searcher bq. uses any caches/filters that rely on being based on the top level IndexReader/Searcher No? It seems like you could be based on a top level reader before, but not rely on the fact that it was a top level ... Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745045#action_12745045 ] Tim Smith commented on LUCENE-1821: --- rely on it is Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1824) FastVectorHighlighter truncates words at beginning and end of fragments
[ https://issues.apache.org/jira/browse/LUCENE-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Vigdor updated LUCENE-1824: Attachment: (was: LUCENE-1824.patch) FastVectorHighlighter truncates words at beginning and end of fragments --- Key: LUCENE-1824 URL: https://issues.apache.org/jira/browse/LUCENE-1824 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Environment: any Reporter: Alex Vigdor Priority: Minor Fix For: 3.1 Attachments: LUCENE-1824.patch FastVectorHighlighter does not take word boundaries into consideration when building fragments, so that in most cases the first and last word of a fragment are truncated. This makes the highlights less legible than they should be. I will attach a patch to BaseFragmentBuilder that resolves this by expanding the start and end boundaries of the fragment to the first whitespace character on either side of the fragment, or the beginning or end of the source text, whichever comes first. This significantly improves legibility, at the cost of returning a slightly larger number of characters than specified for the fragment size. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1824) FastVectorHighlighter truncates words at beginning and end of fragments
[ https://issues.apache.org/jira/browse/LUCENE-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Vigdor updated LUCENE-1824: Attachment: (was: LUCENE-1824-test.patch) FastVectorHighlighter truncates words at beginning and end of fragments --- Key: LUCENE-1824 URL: https://issues.apache.org/jira/browse/LUCENE-1824 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Environment: any Reporter: Alex Vigdor Priority: Minor Fix For: 3.1 Attachments: LUCENE-1824.patch FastVectorHighlighter does not take word boundaries into consideration when building fragments, so that in most cases the first and last word of a fragment are truncated. This makes the highlights less legible than they should be. I will attach a patch to BaseFragmentBuilder that resolves this by expanding the start and end boundaries of the fragment to the first whitespace character on either side of the fragment, or the beginning or end of the source text, whichever comes first. This significantly improves legibility, at the cost of returning a slightly larger number of characters than specified for the fragment size. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1824) FastVectorHighlighter truncates words at beginning and end of fragments
[ https://issues.apache.org/jira/browse/LUCENE-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745048#action_12745048 ] Alex Vigdor commented on LUCENE-1824: - The failing test was due to an extra whitespace character at the beginning of the output, which I think is insignificant. However, I appreciate that the whitespace approach will not work for CJK, so I have moved my modifications to a new WhitespaceFragmentBuilder class and associated test class. The updated patch now contains just these two new classes and no modifications to other code. I don't want to hold up the release of 2.9, but anyone attempting to use the SimpleFragmentsBuilder with latin languages, or others that use whitespace to delimit words, will be dismayed by the rampant truncation! FastVectorHighlighter truncates words at beginning and end of fragments --- Key: LUCENE-1824 URL: https://issues.apache.org/jira/browse/LUCENE-1824 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Environment: any Reporter: Alex Vigdor Priority: Minor Fix For: 3.1 Attachments: LUCENE-1824.patch FastVectorHighlighter does not take word boundaries into consideration when building fragments, so that in most cases the first and last word of a fragment are truncated. This makes the highlights less legible than they should be. I will attach a patch to BaseFragmentBuilder that resolves this by expanding the start and end boundaries of the fragment to the first whitespace character on either side of the fragment, or the beginning or end of the source text, whichever comes first. This significantly improves legibility, at the cost of returning a slightly larger number of characters than specified for the fragment size. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1821) Weight.scorer() not passed doc offset for sub reader
[ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-1821: -- Description: Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader not in searcher } {code} Notes: * This workaround makes it so you cannot serialize your custom Weight implementation was: Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Weight.scorer() not passed doc offset for sub reader -- Key: LUCENE-1821 URL: https://issues.apache.org/jira/browse/LUCENE-1821 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Tim Smith Attachments: LUCENE-1821.patch Now that searching is done on a per segment basis, there is no way for a Scorer to know the actual doc id for the document's it matches (only the relative doc offset into the segment) If using caches in your scorer that are based on the entire index (all segments), there is now no way to index into them properly from inside a Scorer because the scorer is not passed the needed offset to calculate the real docid suggest having Weight.scorer() method also take a integer for the doc offset Abstract Weight class should have a constructor that takes this offset as well as a method to get the offset All Weights that have sub weights must pass this offset down to created sub weights Details on workaround: In order to work around this, you must do the following: * Subclass IndexSearcher * Add int getIndexReaderBase(IndexReader) method to your subclass * during Weight creation, the Weight must hold onto a reference to the passed in Searcher (casted to your sub class) * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader) * Scorer can now rebase any collected docids using this offset Example implementation of getIndexReaderBase(): {code} // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders in your constructor public int getIndexReaderBase(IndexReader reader) { if (reader == getReader()) { return 0; } else { List readers = new ArrayList(); gatherSubReaders(readers); Iterator iter = readers.iterator(); int maxDoc = 0; while (iter.hasNext()) { IndexReader r = (IndexReader)iter.next(); if (r == reader) { return maxDoc; } maxDoc += r.maxDoc(); } } return -1; // reader
Re: Finishing Lucene 2.9
0 issues! Congrats everyone. 2.9 was quite a beast. So looks like we should get a few things in order. 1. Anyone dying to be release manager? I think I could do it, but I'm kind of pressed for time ... 2. Lets start crawling all over this release - bugs/javadoc/packaging etc. 3. In regards to that - I'd like to suggest that we don't do the release branch early for 2.9. I know we normally make the release branch so that further dev can continue on trunk. In this case I don't think that is wise. I propose that we lock down trunk for a while, to force people to concentrate on *this* release. Otherwise we divide our limited forces into two - those working on release, and those working on trunk and beyond. We can kind of enforce this by making the release branch last minute I think. 4. I suggest we offer an early release candidate type build (very soon) - nothing official, nothing signed - just something easier for our user community to test with if they are not very familiar with building a release off of trunk. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Finishing Lucene 2.9
On Wed, Aug 19, 2009 at 10:49 AM, Mark Millermarkrmil...@gmail.com wrote: 3. In regards to that - I'd like to suggest that we don't do the release branch early for 2.9. I know we normally make the release branch so that further dev can continue on trunk. In this case I don't think that is wise. I propose that we lock down trunk for a while, to force people to concentrate on *this* release. Otherwise we divide our limited forces into two - those working on release, and those working on trunk and beyond. We can kind of enforce this by making the release branch last minute I think. +1 I've experienced the extra pain of having to merge every change from branch up until the release (esp when the CHANGES.txt is different and patch fails) - there's really no point - checkins for the next release can normally wait. 4. I suggest we offer an early release candidate type build (very soon) - nothing official, nothing signed - just something easier for our user community to test with if they are not very familiar with building a release off of trunk. +1 I've also observed people bringing up release nits only *after* an official vote for a package has started - that messes up stuff like trying to post-date in CHANGES. Developers should do ant package *now* and bring up issues and objections while it's easy to fix - get everything possible out of the way before the official VOTE thread. A final note - AFAIK, the ReleaseTodo http://wiki.apache.org/jakarta-lucene/ReleaseTodo is for the purpose of helping people do releases - it's not an official release process where every step must be followed... these are only guidelines. There's also no reason why the release manager needs to be the one to do all the items like run RAT, etc. That can be done by anyone interested - including other contributors who do not yet have commit privileges. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Finishing Lucene 2.9
On Wed, Aug 19, 2009 at 1:52 PM, Grant Ingersollgsing...@apache.org wrote: the RM should follow the release procedure as specified. Wiki documents are normally not official - anyone can modify them, and people have been with little/no discussion. I'll admit that I can't always follow java-dev, so I may have missed a vote to codify/upgrade this release guideline as an official process that must be followed. At least I know that's not the case in Solr-land though, and I've updated the wiki to reflect that. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Finishing Lucene 2.9
On Aug 19, 2009, at 2:13 PM, Yonik Seeley wrote: On Wed, Aug 19, 2009 at 1:52 PM, Grant Ingersollgsing...@apache.org wrote: the RM should follow the release procedure as specified. Wiki documents are normally not official - anyone can modify them, and people have been with little/no discussion. I'll admit that I can't always follow java-dev, so I may have missed a vote to codify/upgrade this release guideline as an official process that must be followed. At least I know that's not the case in Solr-land though, and I've updated the wiki to reflect that. I find it scary to think that one release might contain Maven artifacts, for instance, while another, done by a different person, might not, simply b/c the RM doesn't feel like it. I don't agree here, and I don't agree for Solr. Stable RM is as important as backward compatibility, if not more so. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Finishing Lucene 2.9
Okay, I can do the test/beta release dist and host on people.apache.org. Anyone have any pref on what we call this? Its not really a release candidate per say, though I have no problem calling it that. We can go from rc1 to rc20 for all it matters. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Finishing Lucene 2.9
So, are we under a code freeze now? And only doing doc/breakers? -Grant On Aug 19, 2009, at 3:08 PM, Mark Miller wrote: Okay, I can do the test/beta release dist and host on people.apache.org. Anyone have any pref on what we call this? Its not really a release candidate per say, though I have no problem calling it that. We can go from rc1 to rc20 for all it matters. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Finishing Lucene 2.9
Not sure - though if not now, than extremely imminently. I have no problem giving a bit of time for people to weigh in on that. I'm trying to get a feel for what the community wants to do before actually putting anything up or sending anything out to java-user. I'm prepped to go when it makes sense. - Mark Grant Ingersoll wrote: So, are we under a code freeze now? And only doing doc/breakers? -Grant On Aug 19, 2009, at 3:08 PM, Mark Miller wrote: Okay, I can do the test/beta release dist and host on people.apache.org. Anyone have any pref on what we call this? Its not really a release candidate per say, though I have no problem calling it that. We can go from rc1 to rc20 for all it matters. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Finishing Lucene 2.9
On 8/19/09 11:43 AM, Grant Ingersoll wrote: On Aug 19, 2009, at 2:13 PM, Yonik Seeley wrote: On Wed, Aug 19, 2009 at 1:52 PM, Grant Ingersollgsing...@apache.org wrote: the RM should follow the release procedure as specified. Wiki documents are normally not official - anyone can modify them, and people have been with little/no discussion. I'll admit that I can't always follow java-dev, so I may have missed a vote to codify/upgrade this release guideline as an official process that must be followed. At least I know that's not the case in Solr-land though, and I've updated the wiki to reflect that. I find it scary to think that one release might contain Maven artifacts, for instance, while another, done by a different person, might not, simply b/c the RM doesn't feel like it. I don't agree here, and I don't agree for Solr. Stable RM is as important as backward compatibility, if not more so. +1. I too think that the RM should follow the guidelines. Michael - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Finishing Lucene 2.9
When I was the RM I usually sent out a note in advance with a tentative schedule, i.e. code freeze date, length of code freeze period, release date (again, all tentative of course). Then the community could give feedback on that proposed schedule and could plan accordingly. Michael On 8/19/09 1:19 PM, Mark Miller wrote: Not sure - though if not now, than extremely imminently. I have no problem giving a bit of time for people to weigh in on that. I'm trying to get a feel for what the community wants to do before actually putting anything up or sending anything out to java-user. I'm prepped to go when it makes sense. - Mark Grant Ingersoll wrote: So, are we under a code freeze now? And only doing doc/breakers? -Grant On Aug 19, 2009, at 3:08 PM, Mark Miller wrote: Okay, I can do the test/beta release dist and host on people.apache.org. Anyone have any pref on what we call this? Its not really a release candidate per say, though I have no problem calling it that. We can go from rc1 to rc20 for all it matters. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Finishing Lucene 2.9
I hadn't settled on me being the RM yet ;) Though if no one else steps up, I will be. I was suggesting a kind of earlier, looser test jar than what we have previously done as an RC (essentially a nightly (which are hard to find lately IME - last one I got I had to dig through Hudson) of trunk) - just for users that havn't built from svn, and wouldn't normally go through the hassle. The more users testing the earlier, the better. And that is what I was volunteering to do. However, looking at the Release TODO's, this still really fits the mold anyway. No need to do anything special I guess - just get to the RC step quickly I suppose - and knowing that other rcs are likely to follow. - Mark Michael Busch wrote: When I was the RM I usually sent out a note in advance with a tentative schedule, i.e. code freeze date, length of code freeze period, release date (again, all tentative of course). Then the community could give feedback on that proposed schedule and could plan accordingly. Michael On 8/19/09 1:19 PM, Mark Miller wrote: Not sure - though if not now, than extremely imminently. I have no problem giving a bit of time for people to weigh in on that. I'm trying to get a feel for what the community wants to do before actually putting anything up or sending anything out to java-user. I'm prepped to go when it makes sense. - Mark Grant Ingersoll wrote: So, are we under a code freeze now? And only doing doc/breakers? -Grant On Aug 19, 2009, at 3:08 PM, Mark Miller wrote: Okay, I can do the test/beta release dist and host on people.apache.org. Anyone have any pref on what we call this? Its not really a release candidate per say, though I have no problem calling it that. We can go from rc1 to rc20 for all it matters. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Finishing Lucene 2.9
0 issues! Congrats everyone. 2.9 was quite a beast. So looks like we should get a few things in order. 1. Anyone dying to be release manager? I think I could do it, but I'm kind of pressed for time ... 2. Lets start crawling all over this release - bugs/javadoc/packaging etc. 3. In regards to that - I'd like to suggest that we don't do the release branch early for 2.9. I know we normally make the release branch so that further dev can continue on trunk. In this case I don't think that is wise. I propose that we lock down trunk for a while, to force people to concentrate on *this* release. Otherwise we divide our limited forces into two - those working on release, and those working on trunk and beyond. We can kind of enforce this by making the release branch last minute I think. I think 3.0 is a little bit special: We move to Java 1.5, so in my opinion, we should not only remove deprecations, but also add Generics and remove StringBuffer and so on. I have some patches for that available, e.g. the casting currently needed for the Attributes API can be more elegantly solved by using generics (something like T addAttribute(ClassT extends Attribute)). If we do not add generics to the public API in 3.0, we have to wait one major release longer to add them. To get the 3.0 release shortly after 2.9, we should branch now, that the generics commits could be done early. I would also help to do this (at least for the parts I was working on the last time). 4. I suggest we offer an early release candidate type build (very soon) - nothing official, nothing signed - just something easier for our user community to test with if they are not very familiar with building a release off of trunk. +1 Start the release process! Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Finishing Lucene 2.9
Uwe Schindler wrote: 0 issues! Congrats everyone. 2.9 was quite a beast. So looks like we should get a few things in order. 1. Anyone dying to be release manager? I think I could do it, but I'm kind of pressed for time ... 2. Lets start crawling all over this release - bugs/javadoc/packaging etc. 3. In regards to that - I'd like to suggest that we don't do the release branch early for 2.9. I know we normally make the release branch so that further dev can continue on trunk. In this case I don't think that is wise. I propose that we lock down trunk for a while, to force people to concentrate on *this* release. Otherwise we divide our limited forces into two - those working on release, and those working on trunk and beyond. We can kind of enforce this by making the release branch last minute I think. I think 3.0 is a little bit special: We move to Java 1.5, so in my opinion, we should not only remove deprecations, but also add Generics and remove StringBuffer and so on. I have some patches for that available, e.g. the casting currently needed for the Attributes API can be more elegantly solved by using generics (something like T addAttribute(ClassT extends Attribute)). If we do not add generics to the public API in 3.0, we have to wait one major release longer to add them. To get the 3.0 release shortly after 2.9, we should branch now, that the generics commits could be done early. I would also help to do this (at least for the parts I was working on the last time). I forgot about this oddity. Its so weird. Its like we are doing two releases on top of each other - it just seems confusing. Apache Lucene announces 2.9 - a lot of hard work and sweat - move to it and five minutes later Apache Lucene announces 3.0 - very little work, but different and improved (generified anyway). No new features in 3.0. Hold the applause. Now move to it. I vote to make this more sane :) -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1768) NumericRange support for new query parser
[ https://issues.apache.org/jira/browse/LUCENE-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745236#action_12745236 ] Adriano Crestani commented on LUCENE-1768: -- {quote} we should rename RangeQueryNode to TermRangeQueryNode (to match lucene name) I would not do this. RangeQueryNode is in the syntax tree and the syntax of numeric and term ranges is equal, so the query parser cannot know what type of query it is. When this issue is fixed 3.1, this node will use the configuration of data types for field names (date, numeric, term) to create the correct range query. {quote} I think it's ok to rename, as far as I know, the standard.parser.SyntaxParser generates ParametricRangeQueryNode from a range query, which has 2 ParametricQueryNode as child. So, the range processor, will need to convert the 2 ParametricQueryNode to the respective type, based on the user config: TermRangeQueryNode (renamed from RangeQueryNode) or NumericRangeQueryNode. NumericRange support for new query parser - Key: LUCENE-1768 URL: https://issues.apache.org/jira/browse/LUCENE-1768 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1 It would be good to specify some type of schema for the query parser in future, to automatically create NumericRangeQuery for different numeric types? It would then be possible to index a numeric value (double,float,long,int) using NumericField and then the query parser knows, which type of field this is and so it correctly creates a NumericRangeQuery for strings like [1.567..*] or (1.787..19.5]. There is currently no way to extract if a field is numeric from the index, so the user will have to configure the FieldConfig objects in the ConfigHandler. But if this is done, it will not be that difficult to implement the rest. The only difference between the current handling of RangeQuery is then the instantiation of the correct Query type and conversion of the entered numeric values (simple Number.valueOf(...) cast of the user entered numbers). Evenerything else is identical, NumericRangeQuery also supports the MTQ rewrite modes (as it is a MTQ). Another thing is a change in Date semantics. There are some strange flags in the current parser that tells it how to handle dates. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Finishing Lucene 2.9
On 8/19/09 3:16 PM, Uwe Schindler wrote: 0 issues! Congrats everyone. 2.9 was quite a beast. So looks like we should get a few things in order. 1. Anyone dying to be release manager? I think I could do it, but I'm kind of pressed for time ... 2. Lets start crawling all over this release - bugs/javadoc/packaging etc. 3. In regards to that - I'd like to suggest that we don't do the release branch early for 2.9. I know we normally make the release branch so that further dev can continue on trunk. In this case I don't think that is wise. I propose that we lock down trunk for a while, to force people to concentrate on *this* release. Otherwise we divide our limited forces into two - those working on release, and those working on trunk and beyond. We can kind of enforce this by making the release branch last minute I think. I think 3.0 is a little bit special: We move to Java 1.5, so in my opinion, we should not only remove deprecations, but also add Generics and remove StringBuffer and so on. I have some patches for that available, e.g. the casting currently needed for the Attributes API can be more elegantly solved by using generics (something like T addAttribute(ClassT extends Attribute)). If we do not add generics to the public API in 3.0, we have to wait one major release longer to add them. Yes, I added that already in the very first AttributeSource patch - it's currently commented out at the bottom of the class I think. Probably a bit out of date. I definitely want to do that to improve readability of the attributes, it's much nicer with generics. That's how I started coding it and why I started liking the syntax, before I needed to make it a bit ugly for JDK 1.4. Michael - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1823) QueryParser with new features for Lucene 3
[ https://issues.apache.org/jira/browse/LUCENE-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745251#action_12745251 ] Michael Busch commented on LUCENE-1823: --- I think Solr has a feature similar to what I called 'Opaque terms: Nested Queries. QueryParser with new features for Lucene 3 -- Key: LUCENE-1823 URL: https://issues.apache.org/jira/browse/LUCENE-1823 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 I'd like to have a new QueryParser implementation in Lucene 3.1, ideally based on the new QP framework in contrib. It should share as much code as possible with the current StandardQueryParser implementation for easy maintainability. Wish list (feel free to extend): 1. *Operator precedence*: Support operator precedence for boolean operators 2. *Opaque terms*: Ability to plugin an external parser for certain syntax extensions, e.g. XML query terms 3. *Improved RangeQuery syntax*: Use more intuitive =, =, = instead of [] and {} 4. *Support for trierange queries*: See LUCENE-1768 5. *Complex phrases*: See LUCENE-1486 6. *ANY operator*: E.g. (a b c d) ANY 3 should match if 3 of the 4 terms occur in the same document 7. *New syntax for Span queries*: I think the surround parser supports this? 8. *Escaped wildcards*: See LUCENE-588 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org