Compilation issues in contrib/xml-query-parser ?
Hi, Am I the only one who has problems compiling when running "ant test", but not when running "ant compile"? Things break in contrib/xml-query-parser/, and all compilation errors seem to have *Builder classes not seeing *Filter classes: common.compile-core: [mkdir] Created dir: /home/otis/dev/repos/lucene/java/trunk/build/contrib/xml-query-parser/classes/java [javac] Compiling 33 source files to /home/otis/dev/repos/lucene/java/trunk/build/contrib/xml-query-parser/classes/java [javac] /home/otis/dev/repos/lucene/java/trunk/contrib/xml-query-parser/src/java/org/apache/lucene/xmlparser/builders/BooleanFilterBuilder.java:7: cannot find symbol [javac] symbol : class BooleanFilter [javac] location: package org.apache.lucene.search [javac] import org.apache.lucene.search.BooleanFilter; [javac]^ [javac] /home/otis/dev/repos/lucene/java/trunk/contrib/xml-query-parser/src/java/org/apache/lucene/xmlparser/builders/BooleanFilterBuilder.java:9: cannot find symbol [javac] symbol : class FilterClause [javac] location: package org.apache.lucene.search [javac] import org.apache.lucene.search.FilterClause; ... Thanks, Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes
[ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-510. --- Resolution: Fixed Fix Version/s: 2.4 > IndexOutput.writeString() should write length in bytes > -- > > Key: LUCENE-510 > URL: https://issues.apache.org/jira/browse/LUCENE-510 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Affects Versions: 2.1 >Reporter: Doug Cutting >Assignee: Michael McCandless > Fix For: 2.4 > > Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, > SortExternal.java, strings.diff, TestSortExternal.java > > > We should change the format of strings written to indexes so that the length > of the string is in bytes, not Java characters. This issue has been > discussed at: > http://www.mail-archive.com/java-dev@lucene.apache.org/msg01970.html > We must increment the file format number to indicate this change. At least > the format number in the segments file should change. > I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until > after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 > (other than removal of deprecated features). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Compilation issues in contrib/xml-query-parser ?
I'm don't see this problem, on Linux & Mac OS X. JDK 1.5. Mike Otis Gospodnetic wrote: Hi, Am I the only one who has problems compiling when running "ant test", but not when running "ant compile"? Things break in contrib/xml-query-parser/, and all compilation errors seem to have *Builder classes not seeing *Filter classes: common.compile-core: [mkdir] Created dir: /home/otis/dev/repos/lucene/java/trunk/ build/contrib/xml-query-parser/classes/java [javac] Compiling 33 source files to /home/otis/dev/repos/ lucene/java/trunk/build/contrib/xml-query-parser/classes/java [javac] /home/otis/dev/repos/lucene/java/trunk/contrib/xml- query-parser/src/java/org/apache/lucene/xmlparser/builders/ BooleanFilterBuilder.java:7: cannot find symbol [javac] symbol : class BooleanFilter [javac] location: package org.apache.lucene.search [javac] import org.apache.lucene.search.BooleanFilter; [javac]^ [javac] /home/otis/dev/repos/lucene/java/trunk/contrib/xml- query-parser/src/java/org/apache/lucene/xmlparser/builders/ BooleanFilterBuilder.java:9: cannot find symbol [javac] symbol : class FilterClause [javac] location: package org.apache.lucene.search [javac] import org.apache.lucene.search.FilterClause; ... Thanks, Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1187) Things to be done now that Filter is independent from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Elschot updated LUCENE-1187: - Attachment: Contrib20080326.patch Contrib20080326.patch: supersedes the 20080325 version. Generally the same as yesterday, some extensions: - fix a possible synchronisation issue by using a local int[1] array instead of an object int attribute, - return a SortedVIntList when it is definitely smaller than an OpenBitSet, the method doing this is protected. - all constructors in OpenBitSetDISI now also take a initial size argument (still called maxSize, perhaps better renamed to initialSize). Both ChainedFilter and BooleanFilter should work normally, except perhaps using less memory because of the SortedVIntList. ChainedFilter still has the 1.1 ASL, it's probably time to upgrade it, but I did not change it in the patch. > Things to be done now that Filter is independent from BitSet > > > Key: LUCENE-1187 > URL: https://issues.apache.org/jira/browse/LUCENE-1187 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Paul Elschot >Priority: Minor > Attachments: BooleanFilter20080325.patch, > ChainedFilterAndCachingFilterTest.patch, Contrib20080325.patch, > Contrib20080326.patch, javadocsZero2Match.patch, OpenBitSetDISI-20080322.patch > > > (Aside: where is the documentation on how to mark up text in jira comments?) > The following things are left over after LUCENE-584 : > For Lucene 3.0 Filter.bits() will have to be removed. > There is a CHECKME in IndexSearcher about using ConjunctionScorer to have the > boolean behaviour of a Filter. > I have not looked into Filter caching yet, but I suppose there will be some > room for improvement there. > Iirc the current core has moved to use OpenBitSetFilter and that is probably > what is being cached. > In some cases it might be better to cache a SortedVIntList instead. > Boolean logic on DocIdSetIterator is already available for Scorers (that > inherit from DocIdSetIterator) in the search package. This is currently > implemented by ConjunctionScorer, DisjunctionSumScorer, > ReqOptSumScorer and ReqExclScorer. > Boolean logic on BitSets is available in contrib/misc and contrib/queries > DisjunctionSumScorer calls score() on its subscorers before the score value > actually needed. > This could be a reason to introduce a DisjunctionDocIdSetIterator, perhaps as > a superclass of DisjunctionSumScorer. > To fully implement non scoring queries a TermDocIdSetIterator will be needed, > perhaps as a superclass of TermScorer. > The javadocs in org.apache.lucene.search using matching vs non-zero score: > I'll investigate this soon, and provide a patch when necessary. > An early version of the patches of LUCENE-584 contained a class Matcher, > that differs from the current DocIdSet in that Matcher has an explain() > method. > It remains to be seen whether such a Matcher could be useful between > DocIdSet and Scorer. > The semantics of scorer.skipTo(scorer.doc()) was discussed briefly. > This was also discussed at another issue recently, so perhaps it is wortwhile > to open a separate issue for this. > Skipping on a SortedVIntList is done using linear search, this could be > improved by adding multilevel skiplist info much like in the Lucene index for > documents containing a term. > One comment by me of 3 Dec 2008: > A few complete (test) classes are deprecated, it might be good to add the > target release for removal there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1187) Things to be done now that Filter is independent from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Elschot updated LUCENE-1187: - Component/s: Search contrib/* Lucene Fields: [New, Patch Available] (was: [New]) > Things to be done now that Filter is independent from BitSet > > > Key: LUCENE-1187 > URL: https://issues.apache.org/jira/browse/LUCENE-1187 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/*, Search >Reporter: Paul Elschot >Priority: Minor > Attachments: BooleanFilter20080325.patch, > ChainedFilterAndCachingFilterTest.patch, Contrib20080325.patch, > Contrib20080326.patch, javadocsZero2Match.patch, OpenBitSetDISI-20080322.patch > > > (Aside: where is the documentation on how to mark up text in jira comments?) > The following things are left over after LUCENE-584 : > For Lucene 3.0 Filter.bits() will have to be removed. > There is a CHECKME in IndexSearcher about using ConjunctionScorer to have the > boolean behaviour of a Filter. > I have not looked into Filter caching yet, but I suppose there will be some > room for improvement there. > Iirc the current core has moved to use OpenBitSetFilter and that is probably > what is being cached. > In some cases it might be better to cache a SortedVIntList instead. > Boolean logic on DocIdSetIterator is already available for Scorers (that > inherit from DocIdSetIterator) in the search package. This is currently > implemented by ConjunctionScorer, DisjunctionSumScorer, > ReqOptSumScorer and ReqExclScorer. > Boolean logic on BitSets is available in contrib/misc and contrib/queries > DisjunctionSumScorer calls score() on its subscorers before the score value > actually needed. > This could be a reason to introduce a DisjunctionDocIdSetIterator, perhaps as > a superclass of DisjunctionSumScorer. > To fully implement non scoring queries a TermDocIdSetIterator will be needed, > perhaps as a superclass of TermScorer. > The javadocs in org.apache.lucene.search using matching vs non-zero score: > I'll investigate this soon, and provide a patch when necessary. > An early version of the patches of LUCENE-584 contained a class Matcher, > that differs from the current DocIdSet in that Matcher has an explain() > method. > It remains to be seen whether such a Matcher could be useful between > DocIdSet and Scorer. > The semantics of scorer.skipTo(scorer.doc()) was discussed briefly. > This was also discussed at another issue recently, so perhaps it is wortwhile > to open a separate issue for this. > Skipping on a SortedVIntList is done using linear search, this could be > improved by adding multilevel skiplist info much like in the Lucene index for > documents containing a term. > One comment by me of 3 Dec 2008: > A few complete (test) classes are deprecated, it might be good to add the > target release for removal there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-1187) Things to be done now that Filter is independent from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch reassigned LUCENE-1187: - Assignee: Michael Busch > Things to be done now that Filter is independent from BitSet > > > Key: LUCENE-1187 > URL: https://issues.apache.org/jira/browse/LUCENE-1187 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/*, Search >Reporter: Paul Elschot >Assignee: Michael Busch >Priority: Minor > Attachments: BooleanFilter20080325.patch, > ChainedFilterAndCachingFilterTest.patch, Contrib20080325.patch, > Contrib20080326.patch, javadocsZero2Match.patch, OpenBitSetDISI-20080322.patch > > > (Aside: where is the documentation on how to mark up text in jira comments?) > The following things are left over after LUCENE-584 : > For Lucene 3.0 Filter.bits() will have to be removed. > There is a CHECKME in IndexSearcher about using ConjunctionScorer to have the > boolean behaviour of a Filter. > I have not looked into Filter caching yet, but I suppose there will be some > room for improvement there. > Iirc the current core has moved to use OpenBitSetFilter and that is probably > what is being cached. > In some cases it might be better to cache a SortedVIntList instead. > Boolean logic on DocIdSetIterator is already available for Scorers (that > inherit from DocIdSetIterator) in the search package. This is currently > implemented by ConjunctionScorer, DisjunctionSumScorer, > ReqOptSumScorer and ReqExclScorer. > Boolean logic on BitSets is available in contrib/misc and contrib/queries > DisjunctionSumScorer calls score() on its subscorers before the score value > actually needed. > This could be a reason to introduce a DisjunctionDocIdSetIterator, perhaps as > a superclass of DisjunctionSumScorer. > To fully implement non scoring queries a TermDocIdSetIterator will be needed, > perhaps as a superclass of TermScorer. > The javadocs in org.apache.lucene.search using matching vs non-zero score: > I'll investigate this soon, and provide a patch when necessary. > An early version of the patches of LUCENE-584 contained a class Matcher, > that differs from the current DocIdSet in that Matcher has an explain() > method. > It remains to be seen whether such a Matcher could be useful between > DocIdSet and Scorer. > The semantics of scorer.skipTo(scorer.doc()) was discussed briefly. > This was also discussed at another issue recently, so perhaps it is wortwhile > to open a separate issue for this. > Skipping on a SortedVIntList is done using linear search, this could be > improved by adding multilevel skiplist info much like in the Lucene index for > documents containing a term. > One comment by me of 3 Dec 2008: > A few complete (test) classes are deprecated, it might be good to add the > target release for removal there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1187) Things to be done now that Filter is independent from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582414#action_12582414 ] Michael Busch commented on LUCENE-1187: --- Thanks for your patches, Paul. I'll be traveling the next days, but I'll try to look at the patches next week. > Things to be done now that Filter is independent from BitSet > > > Key: LUCENE-1187 > URL: https://issues.apache.org/jira/browse/LUCENE-1187 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/*, Search >Reporter: Paul Elschot >Assignee: Michael Busch >Priority: Minor > Attachments: BooleanFilter20080325.patch, > ChainedFilterAndCachingFilterTest.patch, Contrib20080325.patch, > Contrib20080326.patch, javadocsZero2Match.patch, OpenBitSetDISI-20080322.patch > > > (Aside: where is the documentation on how to mark up text in jira comments?) > The following things are left over after LUCENE-584 : > For Lucene 3.0 Filter.bits() will have to be removed. > There is a CHECKME in IndexSearcher about using ConjunctionScorer to have the > boolean behaviour of a Filter. > I have not looked into Filter caching yet, but I suppose there will be some > room for improvement there. > Iirc the current core has moved to use OpenBitSetFilter and that is probably > what is being cached. > In some cases it might be better to cache a SortedVIntList instead. > Boolean logic on DocIdSetIterator is already available for Scorers (that > inherit from DocIdSetIterator) in the search package. This is currently > implemented by ConjunctionScorer, DisjunctionSumScorer, > ReqOptSumScorer and ReqExclScorer. > Boolean logic on BitSets is available in contrib/misc and contrib/queries > DisjunctionSumScorer calls score() on its subscorers before the score value > actually needed. > This could be a reason to introduce a DisjunctionDocIdSetIterator, perhaps as > a superclass of DisjunctionSumScorer. > To fully implement non scoring queries a TermDocIdSetIterator will be needed, > perhaps as a superclass of TermScorer. > The javadocs in org.apache.lucene.search using matching vs non-zero score: > I'll investigate this soon, and provide a patch when necessary. > An early version of the patches of LUCENE-584 contained a class Matcher, > that differs from the current DocIdSet in that Matcher has an explain() > method. > It remains to be seen whether such a Matcher could be useful between > DocIdSet and Scorer. > The semantics of scorer.skipTo(scorer.doc()) was discussed briefly. > This was also discussed at another issue recently, so perhaps it is wortwhile > to open a separate issue for this. > Skipping on a SortedVIntList is done using linear search, this could be > improved by adding multilevel skiplist info much like in the Lucene index for > documents containing a term. > One comment by me of 3 Dec 2008: > A few complete (test) classes are deprecated, it might be good to add the > target release for removal there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)
[ https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582420#action_12582420 ] Michael McCandless commented on LUCENE-1231: {quote} How would this compare to making the storing of position and freq optional for a field? Then one could have an indexed field with a payload or boost but with no freq (or positions, since freq is required for positions). Would that be equivalent? {quote} I think this would be very similar, except maybe: * This proposal would allow for optional non-sparse, fixed-length storage (ie, don't include the docID since all docs have a payload, and the payload is always the same length). EG norms are like this. * [From the thread linked above] would allow for binary storage of field values. EG for int fields you would store the 4 bytes per value, and populating the cache would be much faster than the FieldCache now (which must re-parse Strings -> ints, and, must walk the terms to "reconstruct" the forward index). * This proposal may allow for updating these values, like we can do with norms today. Maybe this can only work if the field is non-sparse and perhaps if you've loaded it into the FieldCache? This would tie into the LUCENE-831, so that you could load these fields entirely in RAM, incrementally update them from a reopen, etc. > Column-stride fields (aka per-document Payloads) > > > Key: LUCENE-1231 > URL: https://issues.apache.org/jira/browse/LUCENE-1231 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.4 > > > This new feature has been proposed and discussed here: > http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results > Currently it is possible in Lucene to store data as stored fields or as > payloads. > Stored fields provide good performance if you want to load all fields for one > document, because this is an sequential I/O operation. > If you however want to load the data from one field for a large number of > documents, then stored fields perform quite badly, because lot's of I/O seeks > might have to be performed. > A better way to do this is using payloads. By creating a "special" posting > list > that has one posting with payload for each document you can "simulate" a > column- > stride field. The performance is significantly better compared to stored > fields, > however still not optimal. The reason is that for each document the freq > value, > which is in this particular case always 1, has to be decoded, also one > position > value, which is always 0, has to be loaded. > As a solution we want to add real column-stride fields to Lucene. A possible > format for the new data structure could look like this (CSD stands for column- > stride data, once we decide for a final name for this feature we can change > this): > CSDList --> FixedLengthList | > FixedLengthList --> ^SegSize > VariableLengthList --> > Payload --> Byte^PayloadLength > PayloadLength --> VInt > SkipList --> see frq.file > We distinguish here between the fixed length and the variable length cases. To > allow flexibility, Lucene could automatically pick the "right" data > structure. > This could work like this: When the DocumentsWriter writes a segment it > checks > whether all values of a field have the same length. If yes, it stores them as > FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger > merges two or more segments it checks if all segments have a FixedLengthList > with the same length for a column-stride field. If not, it writes a > VariableLengthList to the new segment. > Once this feature is implemented, we should think about making the column- > stride fields updateable, similar to the norms. This will be a very powerful > feature that can for example be used for low-latency tagging of documents. > Other use cases: > - replace norms > - allow to store boost values separately from norms > - as input for the FieldCache, thus providing significantly improved loading > performance (see LUCENE-831) > Things that need to be done here: > - decide for a name for this feature :) - I think "column-stride fields" was > liked better than "per-document payloads" > - Design an API for this feature. We should keep in mind here that these > fields are supposed to be updateable. > - Define datastructures. > I would like to get this feature into 2.4. Feedback about the open questions > is very welcome so that we can finalize the design soon and start > implementing. -- This message is automatically generated by JIRA. - You can reply to this
Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes
Michael McCandless resolved LUCENE-510. Congratulations. :) When I wrote my initial patch, I saw a performance degradation of c. 30% in my indexing benchmarks. Repeated reallocation was presumably one culprit: when length in Java chars is stored in the index, you only need to allocate once, whereas when reading in UTF-8, you can't know just how much memory you need until the read completes. Furthermore, at write-time, you can't look at something composed of 16- bit chars and know what the byte-length of its UTF-8 representation will be without pre-scanning. How did you solve those problems? Are the string diffs and comparisons now performed against raw bytes, so that fewer conversions are needed? Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582422#action_12582422 ] Michael McCandless commented on LUCENE-831: --- One question here: should we switch to a method call, instead of a straight array, to retrieve a cached value for a doc? If we did that, then MultiSearchers would forward the request to the right IndexReader. The benefit then is that reopen() of a reader would not have to allocate & bulk copy massive arrays when updating the caches. It would keep the cost of reopen closer to the size of the new segments. And this way the old reader & the new one would not double-allocate the RAM required to hold the common parts of the cache. We could always still provide a "give me the full array" fallback if people really wanted that (and were willing to accept the cost). > Complete overhaul of FieldCache API/Implementation > -- > > Key: LUCENE-831 > URL: https://issues.apache.org/jira/browse/LUCENE-831 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Hoss Man >Assignee: Michael Busch > Fix For: 2.4 > > Attachments: fieldcache-overhaul.032208.diff, > fieldcache-overhaul.diff, fieldcache-overhaul.diff > > > Motivation: > 1) Complete overhaul the API/implementation of "FieldCache" type things... > a) eliminate global static map keyed on IndexReader (thus > eliminating synch block between completley independent IndexReaders) > b) allow more customization of cache management (ie: use > expiration/replacement strategies, disk backed caches, etc) > c) allow people to define custom cache data logic (ie: custom > parsers, complex datatypes, etc... anything tied to a reader) > d) allow people to inspect what's in a cache (list of CacheKeys) for > an IndexReader so a new IndexReader can be likewise warmed. > e) Lend support for smarter cache management if/when > IndexReader.reopen is added (merging of cached data from subReaders). > 2) Provide backwards compatibility to support existing FieldCache API with > the new implementation, so there is no redundent caching as client code > migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes
Marvin Humphrey wrote: Michael McCandless resolved LUCENE-510. Congratulations. :) Thanks. I didn't quite realize what I was getting myself into when I said "yes" on that issue! When I wrote my initial patch, I saw a performance degradation of c. 30% in my indexing benchmarks. I think it was 20%. Repeated reallocation was presumably one culprit: when length in Java chars is stored in the index, you only need to allocate once, whereas when reading in UTF-8, you can't know just how much memory you need until the read completes. Furthermore, at write-time, you can't look at something composed of 16-bit chars and know what the byte-length of its UTF-8 representation will be without pre-scanning. Right, not doing allocations was pretty much it (the getBytes method of String was most of the slowdown I think). I was also able to eliminate another per-term scan we were doing in DocumentsWriter and fold it into the conversion. I ended up creating custom conversion methods (UTF8toUTF16 & vice- versa) to do this conversion into a re-used byte[] or char[], which grow as needed, then I just bulk-write the bytes. I think this is not much slower than before (modified UTF8) since it also had to go character by character w/ ifs inside that inner loop. I'm less happy with the 11% slowdown on TermEnum, and that's even with the optimization to incrementally decode only the "new" UTF-8 bytes as we are reading the changed suffix of each term, reusing the already-decoded UTF16 chars from the previous term. This will slowdown populating a FieldCache, which is already slow. But LUCENE-831 and LUCENE-1231 should fix that. Are the string diffs and comparisons now performed against raw bytes, so that fewer conversions are needed? Alas, not yet: Lucene still uses UTF16 java chars internally. The conversion to UTF-8 happens "at the last minute" when writing, and "immediately" when reading. I started exploring keeping UTF-8 bytes further in, but it quickly got messy because it would require changing how the term infos are sorted to be unicode code point order. Comparing bytes in UTF-8 is the same as comparing unicode code points, which is nice. But comparing UTF-16 values is almost but not quite the same. So suddenly everywhere where a string comparison takes place I had to assess whether that comparison should be by unicode code point, and call our own method for doing so. It quickly became a "big" project so I ran back to sorting by UTF-16 value. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes
On Wed, Mar 26, 2008 at 5:22 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > Are the string diffs and comparisons now performed against raw > > bytes, so that fewer conversions are needed? > > Alas, not yet: Lucene still uses UTF16 java chars internally. The > conversion to UTF-8 happens "at the last minute" when writing, and > "immediately" when reading. > > I started exploring keeping UTF-8 bytes further in, but it quickly > got messy because it would require changing how the term infos are > sorted to be unicode code point order. Comparing bytes in UTF-8 is > the same as comparing unicode code points, which is nice. But > comparing UTF-16 values is almost but not quite the same. So > suddenly everywhere where a string comparison takes place I had to > assess whether that comparison should be by unicode code point, and > call our own method for doing so. It quickly became a "big" project > so I ran back to sorting by UTF-16 value. Hmmm, can't we always do it by unicode code point? When do we need UTF-16 order? -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)
[ https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582442#action_12582442 ] Doug Cutting commented on LUCENE-1231: -- So there are a number of features these fields would have that differ from other fields: - no freq - no positions - non-sparse representation - binary values (is this different from payloads?) - updateable My question is whether it is best to bundle these together as a new kind of field, or add these as optional features of ordinary fields, or some combination. There are a certain bundles that may work well together: e.g., a dense array of fixed-size, updateable binary values w/o freqs or positions. And not all combinations may be sensible or easy to implement. But most of these would also be useful ala carte too, e.g., no-freqs, no-positions and (perhaps) updateable. BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a reasonable API for updating sparse fields. > Column-stride fields (aka per-document Payloads) > > > Key: LUCENE-1231 > URL: https://issues.apache.org/jira/browse/LUCENE-1231 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.4 > > > This new feature has been proposed and discussed here: > http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results > Currently it is possible in Lucene to store data as stored fields or as > payloads. > Stored fields provide good performance if you want to load all fields for one > document, because this is an sequential I/O operation. > If you however want to load the data from one field for a large number of > documents, then stored fields perform quite badly, because lot's of I/O seeks > might have to be performed. > A better way to do this is using payloads. By creating a "special" posting > list > that has one posting with payload for each document you can "simulate" a > column- > stride field. The performance is significantly better compared to stored > fields, > however still not optimal. The reason is that for each document the freq > value, > which is in this particular case always 1, has to be decoded, also one > position > value, which is always 0, has to be loaded. > As a solution we want to add real column-stride fields to Lucene. A possible > format for the new data structure could look like this (CSD stands for column- > stride data, once we decide for a final name for this feature we can change > this): > CSDList --> FixedLengthList | > FixedLengthList --> ^SegSize > VariableLengthList --> > Payload --> Byte^PayloadLength > PayloadLength --> VInt > SkipList --> see frq.file > We distinguish here between the fixed length and the variable length cases. To > allow flexibility, Lucene could automatically pick the "right" data > structure. > This could work like this: When the DocumentsWriter writes a segment it > checks > whether all values of a field have the same length. If yes, it stores them as > FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger > merges two or more segments it checks if all segments have a FixedLengthList > with the same length for a column-stride field. If not, it writes a > VariableLengthList to the new segment. > Once this feature is implemented, we should think about making the column- > stride fields updateable, similar to the norms. This will be a very powerful > feature that can for example be used for low-latency tagging of documents. > Other use cases: > - replace norms > - allow to store boost values separately from norms > - as input for the FieldCache, thus providing significantly improved loading > performance (see LUCENE-831) > Things that need to be done here: > - decide for a name for this feature :) - I think "column-stride fields" was > liked better than "per-document payloads" > - Design an API for this feature. We should keep in mind here that these > fields are supposed to be updateable. > - Define datastructures. > I would like to get this feature into 2.4. Feedback about the open questions > is very welcome so that we can finalize the design soon and start > implementing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582443#action_12582443 ] Michael Busch commented on LUCENE-831: -- {quote} The benefit then is that reopen() of a reader would not have to allocate & bulk copy massive arrays when updating the caches. It would keep the cost of reopen closer to the size of the new segments. {quote} I agree, Mike. Currently during reopen() the MultiSegmentReader allocates a new norms array with size maxDoc(), which is, as you said, inefficient if only some (maybe even small) segments changed. The method call might be a little slower than the array lookup, but I doubt that this would be very significant. We can make this change for the norms and run performance tests to measure the slowdown. > Complete overhaul of FieldCache API/Implementation > -- > > Key: LUCENE-831 > URL: https://issues.apache.org/jira/browse/LUCENE-831 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Hoss Man >Assignee: Michael Busch > Fix For: 2.4 > > Attachments: fieldcache-overhaul.032208.diff, > fieldcache-overhaul.diff, fieldcache-overhaul.diff > > > Motivation: > 1) Complete overhaul the API/implementation of "FieldCache" type things... > a) eliminate global static map keyed on IndexReader (thus > eliminating synch block between completley independent IndexReaders) > b) allow more customization of cache management (ie: use > expiration/replacement strategies, disk backed caches, etc) > c) allow people to define custom cache data logic (ie: custom > parsers, complex datatypes, etc... anything tied to a reader) > d) allow people to inspect what's in a cache (list of CacheKeys) for > an IndexReader so a new IndexReader can be likewise warmed. > e) Lend support for smarter cache management if/when > IndexReader.reopen is added (merging of cached data from subReaders). > 2) Provide backwards compatibility to support existing FieldCache API with > the new implementation, so there is no redundent caching as client code > migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes
Yonik Seeley <[EMAIL PROTECTED]> wrote: > Hmmm, can't we always do it by unicode code point? > When do we need UTF-16 order? In theory, we can. I think the sort order doesn't matter much, as long as everyone (writers & readers) agree what it is. I think unicode code point order is more "standards compliant" too. A big benefit is then we could leave things (eg TermBuffer and maybe eventually Term, FieldCache) as UTF8 bytes and save on the conversion cost when reading. But I don't think Java provides a way to do this comparison? However it's not hard to implement your own: http://www.icu-project.org/docs/papers/utf16_code_point_order.html But then I worried about how much slower that code is than String.compareTo, and, I found alot of places where innocent compareTo or < or > needed to be changed to this method call. Field name comparisons would have to be fixed too. Then for backwards compatibility all of these places that do comparisons would have to fallback to the Java way when interacting with an older segment. I think we can still explore this? It just seemed way too big to glomm into the already-big changes in LUCENE-510. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes
On Wed, Mar 26, 2008 at 6:06 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > Yonik Seeley <[EMAIL PROTECTED]> wrote: > > > Hmmm, can't we always do it by unicode code point? > > When do we need UTF-16 order? > > In theory, we can. I think the sort order doesn't matter much, as > long as everyone (writers & readers) agree what it is. I think > unicode code point order is more "standards compliant" too. > > A big benefit is then we could leave things (eg TermBuffer and maybe > eventually Term, FieldCache) as UTF8 bytes and save on the conversion > cost when reading. > > But I don't think Java provides a way to do this comparison? However > it's not hard to implement your own: > > http://www.icu-project.org/docs/papers/utf16_code_point_order.html Not sure I follow... you just do a byte-by-byte comparison right? For ASCII, this should be slightly faster (same number of comparisons, less memory space and hence less cache space overall). > But then I worried about how much slower that code is than > String.compareTo, and, I found alot of places where innocent compareTo > or < or > needed to be changed to this method call. Field name > comparisons would have to be fixed too. Then for backwards > compatibility all of these places that do comparisons would have to > fallback to the Java way when interacting with an older segment. Oh... older segments. Yeah, I was speaking "theoretically". > I think we can still explore this? It just seemed way too big to > glomm into the already-big changes in LUCENE-510. Yeah, I was thinking of some of this more along the lines of Lucene 3. A term could contain a byte array instead of a String. A String constructor would convert to UTF8 and then do lookups in the index (simple byte comparisons, no charset encoding). A byte constructor for Term would also be allowed. Things like TermEnumerators would keep everything in bytes, the tii would be in bytes, etc. One could also think about ways to directly index bytes too. Is it all worth it? I really don't know. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes
Yonik Seeley wrote: On Wed, Mar 26, 2008 at 6:06 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: Yonik Seeley <[EMAIL PROTECTED]> wrote: Hmmm, can't we always do it by unicode code point? When do we need UTF-16 order? In theory, we can. I think the sort order doesn't matter much, as long as everyone (writers & readers) agree what it is. I think unicode code point order is more "standards compliant" too. A big benefit is then we could leave things (eg TermBuffer and maybe eventually Term, FieldCache) as UTF8 bytes and save on the conversion cost when reading. But I don't think Java provides a way to do this comparison? However it's not hard to implement your own: http://www.icu-project.org/docs/papers/utf16_code_point_order.html Not sure I follow... you just do a byte-by-byte comparison right? For ASCII, this should be slightly faster (same number of comparisons, less memory space and hence less cache space overall). Sorry, you're right: if you're working with byte[] at the time, a byte by byte comparison of UTF8 gives you the same order as unicode code point. It's when you need to compare a String or char[] to one another, or to a UTF8 byte[], that you need that code. But then I worried about how much slower that code is than String.compareTo, and, I found alot of places where innocent compareTo or < or > needed to be changed to this method call. Field name comparisons would have to be fixed too. Then for backwards compatibility all of these places that do comparisons would have to fallback to the Java way when interacting with an older segment. Oh... older segments. Yeah, I was speaking "theoretically". Yeah. I think we can still explore this? It just seemed way too big to glomm into the already-big changes in LUCENE-510. Yeah, I was thinking of some of this more along the lines of Lucene 3. A term could contain a byte array instead of a String. A String constructor would convert to UTF8 and then do lookups in the index (simple byte comparisons, no charset encoding). A byte constructor for Term would also be allowed. Things like TermEnumerators would keep everything in bytes, the tii would be in bytes, etc. Yup. One could also think about ways to directly index bytes too. Right, DocumentsWriter could hold its terms in byte[] and save time/ space when terms are ascii. Is it all worth it? I really don't know. Right, that's where I started to wonder. It felt very much like I was "going against the grain of Java" as the changes started to pile up ... Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)
[ https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582464#action_12582464 ] Michael McCandless commented on LUCENE-1231: Sorry you're right: the payload is the binary data. {quote} So there are a number of features these fields would have that differ from other fields: {quote} Maybe add "stored in its own file" or some such, to that list. Ie to efficiently update field X I would think you want it stored in its own file. We would then fully write a new geneation of that file whenever it had changes. I agree it would be great to implement this as "flexible indexing", such that these are simply a-la-cart options on how the field is indexed, rather than make a new specialized kind of field that just does one of these "combinations". But I haven't wrapped my brain around what all this will entail... it's a biggie! {quote} BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a reasonable API for updating sparse fields. {quote} I like that! > Column-stride fields (aka per-document Payloads) > > > Key: LUCENE-1231 > URL: https://issues.apache.org/jira/browse/LUCENE-1231 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.4 > > > This new feature has been proposed and discussed here: > http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results > Currently it is possible in Lucene to store data as stored fields or as > payloads. > Stored fields provide good performance if you want to load all fields for one > document, because this is an sequential I/O operation. > If you however want to load the data from one field for a large number of > documents, then stored fields perform quite badly, because lot's of I/O seeks > might have to be performed. > A better way to do this is using payloads. By creating a "special" posting > list > that has one posting with payload for each document you can "simulate" a > column- > stride field. The performance is significantly better compared to stored > fields, > however still not optimal. The reason is that for each document the freq > value, > which is in this particular case always 1, has to be decoded, also one > position > value, which is always 0, has to be loaded. > As a solution we want to add real column-stride fields to Lucene. A possible > format for the new data structure could look like this (CSD stands for column- > stride data, once we decide for a final name for this feature we can change > this): > CSDList --> FixedLengthList | > FixedLengthList --> ^SegSize > VariableLengthList --> > Payload --> Byte^PayloadLength > PayloadLength --> VInt > SkipList --> see frq.file > We distinguish here between the fixed length and the variable length cases. To > allow flexibility, Lucene could automatically pick the "right" data > structure. > This could work like this: When the DocumentsWriter writes a segment it > checks > whether all values of a field have the same length. If yes, it stores them as > FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger > merges two or more segments it checks if all segments have a FixedLengthList > with the same length for a column-stride field. If not, it writes a > VariableLengthList to the new segment. > Once this feature is implemented, we should think about making the column- > stride fields updateable, similar to the norms. This will be a very powerful > feature that can for example be used for low-latency tagging of documents. > Other use cases: > - replace norms > - allow to store boost values separately from norms > - as input for the FieldCache, thus providing significantly improved loading > performance (see LUCENE-831) > Things that need to be done here: > - decide for a name for this feature :) - I think "column-stride fields" was > liked better than "per-document payloads" > - Design an API for this feature. We should keep in mind here that these > fields are supposed to be updateable. > - Define datastructures. > I would like to get this feature into 2.4. Feedback about the open questions > is very welcome so that we can finalize the design soon and start > implementing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582471#action_12582471 ] Mark Miller commented on LUCENE-831: >If you're going to incrementally update a FieldCache of a MultiReader, it's >the same issue... can't merge the ordinals without the original (String) >>values. That is a great point. >should we switch to a method call, instead of a straight array, to retrieve a >cached value for a doc? Sounds like a great idea to me. Solves the StringIndex merge and eliminates all merge costs at the price of a method call per access. > Complete overhaul of FieldCache API/Implementation > -- > > Key: LUCENE-831 > URL: https://issues.apache.org/jira/browse/LUCENE-831 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Hoss Man >Assignee: Michael Busch > Fix For: 2.4 > > Attachments: fieldcache-overhaul.032208.diff, > fieldcache-overhaul.diff, fieldcache-overhaul.diff > > > Motivation: > 1) Complete overhaul the API/implementation of "FieldCache" type things... > a) eliminate global static map keyed on IndexReader (thus > eliminating synch block between completley independent IndexReaders) > b) allow more customization of cache management (ie: use > expiration/replacement strategies, disk backed caches, etc) > c) allow people to define custom cache data logic (ie: custom > parsers, complex datatypes, etc... anything tied to a reader) > d) allow people to inspect what's in a cache (list of CacheKeys) for > an IndexReader so a new IndexReader can be likewise warmed. > e) Lend support for smarter cache management if/when > IndexReader.reopen is added (merging of cached data from subReaders). > 2) Provide backwards compatibility to support existing FieldCache API with > the new implementation, so there is no redundent caching as client code > migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582480#action_12582480 ] Mark Miller commented on LUCENE-831: Hmm...how do we avoid having to pull the cached field values through a sync on every call? The field data has to be cached...and the method to return the single cached field value has to be multi-threaded... > Complete overhaul of FieldCache API/Implementation > -- > > Key: LUCENE-831 > URL: https://issues.apache.org/jira/browse/LUCENE-831 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Hoss Man >Assignee: Michael Busch > Fix For: 2.4 > > Attachments: fieldcache-overhaul.032208.diff, > fieldcache-overhaul.diff, fieldcache-overhaul.diff > > > Motivation: > 1) Complete overhaul the API/implementation of "FieldCache" type things... > a) eliminate global static map keyed on IndexReader (thus > eliminating synch block between completley independent IndexReaders) > b) allow more customization of cache management (ie: use > expiration/replacement strategies, disk backed caches, etc) > c) allow people to define custom cache data logic (ie: custom > parsers, complex datatypes, etc... anything tied to a reader) > d) allow people to inspect what's in a cache (list of CacheKeys) for > an IndexReader so a new IndexReader can be likewise warmed. > e) Lend support for smarter cache management if/when > IndexReader.reopen is added (merging of cached data from subReaders). > 2) Provide backwards compatibility to support existing FieldCache API with > the new implementation, so there is no redundent caching as client code > migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1245) MultiFieldQueryParser is not friendly for overriding getFieldQuery(String,String,int)
[ https://issues.apache.org/jira/browse/LUCENE-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trejkaz updated LUCENE-1245: Lucene Fields: [New, Patch Available] (was: [New]) Summary: MultiFieldQueryParser is not friendly for overriding getFieldQuery(String,String,int) (was: MultiFieldQueryParser is not friendly for overriding) (Updating title to be more specific about what wasn't friendly.) > MultiFieldQueryParser is not friendly for overriding > getFieldQuery(String,String,int) > - > > Key: LUCENE-1245 > URL: https://issues.apache.org/jira/browse/LUCENE-1245 > Project: Lucene - Java > Issue Type: Improvement > Components: QueryParser >Affects Versions: 2.3.2 >Reporter: Trejkaz > > LUCENE-1213 fixed an issue in MultiFieldQueryParser where the slop parameter > wasn't being properly applied. Problem is, the fix which eventually got > committed is calling super.getFieldQuery(String,String), bypassing any > possibility of customising the query behaviour. > This should be relatively simply fixable by modifying > getFieldQuery(String,String,int) to, if field is null, recursively call > getFieldQuery(String,String,int) instead of setting the slop itself. This > gives subclasses which override either getFieldQuery method a chance to do > something different. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1245) MultiFieldQueryParser is not friendly for overriding getFieldQuery(String,String,int)
[ https://issues.apache.org/jira/browse/LUCENE-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trejkaz updated LUCENE-1245: Attachment: multifield.patch Fix makes getFieldQuery(String,String) and getFieldQuery(String,String,int) work more or less the same. Neither calls methods on super and thus overriding the methods will work (and does. Although I have no unit test for this yet.) Common boosting logic is extracted to an applyBoost method. Also the check for the clauses being empty, I have removed... as getBooleanQuery appears to be doing that already. > MultiFieldQueryParser is not friendly for overriding > getFieldQuery(String,String,int) > - > > Key: LUCENE-1245 > URL: https://issues.apache.org/jira/browse/LUCENE-1245 > Project: Lucene - Java > Issue Type: Improvement > Components: QueryParser >Affects Versions: 2.3.2 >Reporter: Trejkaz > Attachments: multifield.patch > > > LUCENE-1213 fixed an issue in MultiFieldQueryParser where the slop parameter > wasn't being properly applied. Problem is, the fix which eventually got > committed is calling super.getFieldQuery(String,String), bypassing any > possibility of customising the query behaviour. > This should be relatively simply fixable by modifying > getFieldQuery(String,String,int) to, if field is null, recursively call > getFieldQuery(String,String,int) instead of setting the slop itself. This > gives subclasses which override either getFieldQuery method a chance to do > something different. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1245) MultiFieldQueryParser is not friendly for overriding getFieldQuery(String,String,int)
[ https://issues.apache.org/jira/browse/LUCENE-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582490#action_12582490 ] Trejkaz commented on LUCENE-1245: - Here's an example illustrating the way we were using it, although instead of changing the query text we're actually returning a different query class -- that class isn't in Lucene Core and also it's easier to build up an expected query if it's just a TermQuery. public void testOverrideGetFieldQuery() throws Exception { String[] fields = { "a", "b" }; QueryParser parser = new MultiFieldQueryParser(fields, new StandardAnalyzer()) { protected Query getFieldQuery(String field, String queryText, int slop) throws ParseException { if (field != null && slop == 1) { field = "z" + field; } return super.getFieldQuery(field, queryText, slop); } }; BooleanQuery expected = new BooleanQuery(); expected.add(new TermQuery(new Term("a", "zabc")), BooleanClause.Occur.SHOULD); expected.add(new TermQuery(new Term("b", "zabc")), BooleanClause.Occur.SHOULD); assertEquals("Expected a mangled query", expected, parser.parse("\"abc\"~1")); } > MultiFieldQueryParser is not friendly for overriding > getFieldQuery(String,String,int) > - > > Key: LUCENE-1245 > URL: https://issues.apache.org/jira/browse/LUCENE-1245 > Project: Lucene - Java > Issue Type: Improvement > Components: QueryParser >Affects Versions: 2.3.2 >Reporter: Trejkaz > Attachments: multifield.patch > > > LUCENE-1213 fixed an issue in MultiFieldQueryParser where the slop parameter > wasn't being properly applied. Problem is, the fix which eventually got > committed is calling super.getFieldQuery(String,String), bypassing any > possibility of customising the query behaviour. > This should be relatively simply fixable by modifying > getFieldQuery(String,String,int) to, if field is null, recursively call > getFieldQuery(String,String,int) instead of setting the slop itself. This > gives subclasses which override either getFieldQuery method a chance to do > something different. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Issue Comment Edited: (LUCENE-1245) MultiFieldQueryParser is not friendly for overriding getFieldQuery(String,String,int)
[ https://issues.apache.org/jira/browse/LUCENE-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582490#action_12582490 ] trejkaz edited comment on LUCENE-1245 at 3/26/08 5:13 PM: -- Here's an example illustrating the way we were using it, although instead of changing the query text we're actually returning a different query class -- that class isn't in Lucene Core and also it's easier to build up an expected query if it's just a TermQuery. {noformat} public void testOverrideGetFieldQuery() throws Exception { String[] fields = { "a", "b" }; QueryParser parser = new MultiFieldQueryParser(fields, new StandardAnalyzer()) { protected Query getFieldQuery(String field, String queryText, int slop) throws ParseException { if (field != null && slop == 1) { field = "z" + field; } return super.getFieldQuery(field, queryText, slop); } }; BooleanQuery expected = new BooleanQuery(); expected.add(new TermQuery(new Term("a", "zabc")), BooleanClause.Occur.SHOULD); expected.add(new TermQuery(new Term("b", "zabc")), BooleanClause.Occur.SHOULD); assertEquals("Expected a mangled query", expected, parser.parse("\"abc\"~1")); } {noformat} was (Author: trejkaz): Here's an example illustrating the way we were using it, although instead of changing the query text we're actually returning a different query class -- that class isn't in Lucene Core and also it's easier to build up an expected query if it's just a TermQuery. public void testOverrideGetFieldQuery() throws Exception { String[] fields = { "a", "b" }; QueryParser parser = new MultiFieldQueryParser(fields, new StandardAnalyzer()) { protected Query getFieldQuery(String field, String queryText, int slop) throws ParseException { if (field != null && slop == 1) { field = "z" + field; } return super.getFieldQuery(field, queryText, slop); } }; BooleanQuery expected = new BooleanQuery(); expected.add(new TermQuery(new Term("a", "zabc")), BooleanClause.Occur.SHOULD); expected.add(new TermQuery(new Term("b", "zabc")), BooleanClause.Occur.SHOULD); assertEquals("Expected a mangled query", expected, parser.parse("\"abc\"~1")); } > MultiFieldQueryParser is not friendly for overriding > getFieldQuery(String,String,int) > - > > Key: LUCENE-1245 > URL: https://issues.apache.org/jira/browse/LUCENE-1245 > Project: Lucene - Java > Issue Type: Improvement > Components: QueryParser >Affects Versions: 2.3.2 >Reporter: Trejkaz > Attachments: multifield.patch > > > LUCENE-1213 fixed an issue in MultiFieldQueryParser where the slop parameter > wasn't being properly applied. Problem is, the fix which eventually got > committed is calling super.getFieldQuery(String,String), bypassing any > possibility of customising the query behaviour. > This should be relatively simply fixable by modifying > getFieldQuery(String,String,int) to, if field is null, recursively call > getFieldQuery(String,String,int) instead of setting the slop itself. This > gives subclasses which override either getFieldQuery method a chance to do > something different. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Issue Comment Edited: (LUCENE-1245) MultiFieldQueryParser is not friendly for overriding getFieldQuery(String,String,int)
[ https://issues.apache.org/jira/browse/LUCENE-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582490#action_12582490 ] trejkaz edited comment on LUCENE-1245 at 3/26/08 5:32 PM: -- Here's an example illustrating the way we were using it, although instead of changing the query text we're actually returning a different query class -- that class isn't in Lucene Core and also it's easier to build up an expected query if it's just a TermQuery. {noformat} public void testOverrideGetFieldQuery() throws Exception { String[] fields = { "a", "b" }; QueryParser parser = new MultiFieldQueryParser(fields, new StandardAnalyzer()) { protected Query getFieldQuery(String field, String queryText, int slop) throws ParseException { if (field != null && slop == 1) { queryText = "z" + queryText; } return super.getFieldQuery(field, queryText, slop); } }; BooleanQuery expected = new BooleanQuery(); expected.add(new TermQuery(new Term("a", "zabc")), BooleanClause.Occur.SHOULD); expected.add(new TermQuery(new Term("b", "zabc")), BooleanClause.Occur.SHOULD); assertEquals("Expected a mangled query", expected, parser.parse("\"abc\"~1")); } {noformat} was (Author: trejkaz): Here's an example illustrating the way we were using it, although instead of changing the query text we're actually returning a different query class -- that class isn't in Lucene Core and also it's easier to build up an expected query if it's just a TermQuery. {noformat} public void testOverrideGetFieldQuery() throws Exception { String[] fields = { "a", "b" }; QueryParser parser = new MultiFieldQueryParser(fields, new StandardAnalyzer()) { protected Query getFieldQuery(String field, String queryText, int slop) throws ParseException { if (field != null && slop == 1) { field = "z" + field; } return super.getFieldQuery(field, queryText, slop); } }; BooleanQuery expected = new BooleanQuery(); expected.add(new TermQuery(new Term("a", "zabc")), BooleanClause.Occur.SHOULD); expected.add(new TermQuery(new Term("b", "zabc")), BooleanClause.Occur.SHOULD); assertEquals("Expected a mangled query", expected, parser.parse("\"abc\"~1")); } {noformat} > MultiFieldQueryParser is not friendly for overriding > getFieldQuery(String,String,int) > - > > Key: LUCENE-1245 > URL: https://issues.apache.org/jira/browse/LUCENE-1245 > Project: Lucene - Java > Issue Type: Improvement > Components: QueryParser >Affects Versions: 2.3.2 >Reporter: Trejkaz > Attachments: multifield.patch > > > LUCENE-1213 fixed an issue in MultiFieldQueryParser where the slop parameter > wasn't being properly applied. Problem is, the fix which eventually got > committed is calling super.getFieldQuery(String,String), bypassing any > possibility of customising the query behaviour. > This should be relatively simply fixable by modifying > getFieldQuery(String,String,int) to, if field is null, recursively call > getFieldQuery(String,String,int) instead of setting the slop itself. This > gives subclasses which override either getFieldQuery method a chance to do > something different. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Hudson build is back to normal: Lucene-trunk #417
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/417/changes - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]