date:20080326

Compilation issues in contrib/xml-query-parser ?

2008-03-26 Thread Otis Gospodnetic

Hi,

Am I the only one who has problems compiling when running "ant test", but not 
when running "ant compile"?
Things break in contrib/xml-query-parser/, and all compilation errors seem to 
have *Builder classes not seeing *Filter classes:

common.compile-core:
[mkdir] Created dir: 
/home/otis/dev/repos/lucene/java/trunk/build/contrib/xml-query-parser/classes/java
[javac] Compiling 33 source files to 
/home/otis/dev/repos/lucene/java/trunk/build/contrib/xml-query-parser/classes/java
[javac] 
/home/otis/dev/repos/lucene/java/trunk/contrib/xml-query-parser/src/java/org/apache/lucene/xmlparser/builders/BooleanFilterBuilder.java:7:
 cannot find symbol
[javac] symbol  : class BooleanFilter
[javac] location: package org.apache.lucene.search
[javac] import org.apache.lucene.search.BooleanFilter;
[javac]^
[javac] 
/home/otis/dev/repos/lucene/java/trunk/contrib/xml-query-parser/src/java/org/apache/lucene/xmlparser/builders/BooleanFilterBuilder.java:9:
 cannot find symbol
[javac] symbol  : class FilterClause
[javac] location: package org.apache.lucene.search
[javac] import org.apache.lucene.search.FilterClause;

...


Thanks,
Otis



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

2008-03-26 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-510.
---

   Resolution: Fixed
Fix Version/s: 2.4

> IndexOutput.writeString() should write length in bytes
> --
>
> Key: LUCENE-510
> URL: https://issues.apache.org/jira/browse/LUCENE-510
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 2.1
>Reporter: Doug Cutting
>Assignee: Michael McCandless
> Fix For: 2.4
>
> Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, 
> SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length 
> of the string is in bytes, not Java characters.  This issue has been 
> discussed at:
> http://www.mail-archive.com/java-dev@lucene.apache.org/msg01970.html
> We must increment the file format number to indicate this change.  At least 
> the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until 
> after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 
> (other than removal of deprecated features).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Compilation issues in contrib/xml-query-parser ?

2008-03-26 Thread Michael McCandless



I'm don't see this problem, on Linux & Mac OS X.  JDK 1.5.

Mike

Otis Gospodnetic wrote:

Hi,

Am I the only one who has problems compiling when running "ant  
test", but not when running "ant compile"?
Things break in contrib/xml-query-parser/, and all compilation  
errors seem to have *Builder classes not seeing *Filter classes:


common.compile-core:
[mkdir] Created dir: /home/otis/dev/repos/lucene/java/trunk/ 
build/contrib/xml-query-parser/classes/java
[javac] Compiling 33 source files to /home/otis/dev/repos/ 
lucene/java/trunk/build/contrib/xml-query-parser/classes/java
[javac] /home/otis/dev/repos/lucene/java/trunk/contrib/xml- 
query-parser/src/java/org/apache/lucene/xmlparser/builders/ 
BooleanFilterBuilder.java:7: cannot find symbol

[javac] symbol  : class BooleanFilter
[javac] location: package org.apache.lucene.search
[javac] import org.apache.lucene.search.BooleanFilter;
[javac]^
[javac] /home/otis/dev/repos/lucene/java/trunk/contrib/xml- 
query-parser/src/java/org/apache/lucene/xmlparser/builders/ 
BooleanFilterBuilder.java:9: cannot find symbol

[javac] symbol  : class FilterClause
[javac] location: package org.apache.lucene.search
[javac] import org.apache.lucene.search.FilterClause;

...


Thanks,
Otis



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1187) Things to be done now that Filter is independent from BitSet

2008-03-26 Thread Paul Elschot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-1187:
-

Attachment: Contrib20080326.patch

Contrib20080326.patch: supersedes the 20080325 version. Generally the same as 
yesterday, some extensions:
- fix a possible synchronisation issue by using a local int[1] array instead of 
an object int attribute,
- return a SortedVIntList when it is definitely smaller than an OpenBitSet, the 
method doing this is protected.
- all constructors in OpenBitSetDISI now also take a initial size argument
(still called maxSize, perhaps better renamed to initialSize).

Both ChainedFilter and BooleanFilter should work normally, except perhaps using 
less memory because of the SortedVIntList.

ChainedFilter still has the 1.1 ASL, it's probably time to upgrade it, but I 
did not change it in the patch.


> Things to be done now that Filter is independent from BitSet
> 
>
> Key: LUCENE-1187
> URL: https://issues.apache.org/jira/browse/LUCENE-1187
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: BooleanFilter20080325.patch, 
> ChainedFilterAndCachingFilterTest.patch, Contrib20080325.patch, 
> Contrib20080326.patch, javadocsZero2Match.patch, OpenBitSetDISI-20080322.patch
>
>
> (Aside: where is the documentation on how to mark up text in jira comments?)
> The following things are left over after LUCENE-584 :
> For Lucene 3.0  Filter.bits() will have to be removed.
> There is a CHECKME in IndexSearcher about using ConjunctionScorer to have the 
> boolean behaviour of a Filter.
> I have not looked into Filter caching yet, but I suppose there will be some 
> room for improvement there.
> Iirc the current core has moved to use OpenBitSetFilter and that is probably 
> what is being cached.
> In some cases it might be better to cache a SortedVIntList instead.
> Boolean logic on DocIdSetIterator is already available for Scorers (that 
> inherit from DocIdSetIterator) in the search package. This is currently 
> implemented by ConjunctionScorer, DisjunctionSumScorer,
> ReqOptSumScorer and ReqExclScorer.
> Boolean logic on BitSets is available in contrib/misc and contrib/queries
> DisjunctionSumScorer calls score() on its subscorers before the score value 
> actually needed.
> This could be a reason to introduce a DisjunctionDocIdSetIterator, perhaps as 
> a superclass of DisjunctionSumScorer.
> To fully implement non scoring queries a TermDocIdSetIterator will be needed, 
> perhaps as a superclass of TermScorer.
> The javadocs in org.apache.lucene.search using matching vs non-zero score:
> I'll investigate this soon, and provide a patch when necessary.
> An early version of the patches of LUCENE-584 contained a class Matcher,
> that differs from the current DocIdSet in that Matcher has an explain() 
> method.
> It remains to be seen whether such a Matcher could be useful between
> DocIdSet and Scorer.
> The semantics of scorer.skipTo(scorer.doc()) was discussed briefly.
> This was also discussed at another issue recently, so perhaps it is wortwhile 
> to open a separate issue for this.
> Skipping on a SortedVIntList is done using linear search, this could be 
> improved by adding multilevel skiplist info much like in the Lucene index for 
> documents containing a term.
> One comment by me of 3 Dec 2008:
> A few complete (test) classes are deprecated, it might be good to add the 
> target release for removal there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1187) Things to be done now that Filter is independent from BitSet

2008-03-26 Thread Paul Elschot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-1187:
-

  Component/s: Search
   contrib/*
Lucene Fields: [New, Patch Available]  (was: [New])

> Things to be done now that Filter is independent from BitSet
> 
>
> Key: LUCENE-1187
> URL: https://issues.apache.org/jira/browse/LUCENE-1187
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*, Search
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: BooleanFilter20080325.patch, 
> ChainedFilterAndCachingFilterTest.patch, Contrib20080325.patch, 
> Contrib20080326.patch, javadocsZero2Match.patch, OpenBitSetDISI-20080322.patch
>
>
> (Aside: where is the documentation on how to mark up text in jira comments?)
> The following things are left over after LUCENE-584 :
> For Lucene 3.0  Filter.bits() will have to be removed.
> There is a CHECKME in IndexSearcher about using ConjunctionScorer to have the 
> boolean behaviour of a Filter.
> I have not looked into Filter caching yet, but I suppose there will be some 
> room for improvement there.
> Iirc the current core has moved to use OpenBitSetFilter and that is probably 
> what is being cached.
> In some cases it might be better to cache a SortedVIntList instead.
> Boolean logic on DocIdSetIterator is already available for Scorers (that 
> inherit from DocIdSetIterator) in the search package. This is currently 
> implemented by ConjunctionScorer, DisjunctionSumScorer,
> ReqOptSumScorer and ReqExclScorer.
> Boolean logic on BitSets is available in contrib/misc and contrib/queries
> DisjunctionSumScorer calls score() on its subscorers before the score value 
> actually needed.
> This could be a reason to introduce a DisjunctionDocIdSetIterator, perhaps as 
> a superclass of DisjunctionSumScorer.
> To fully implement non scoring queries a TermDocIdSetIterator will be needed, 
> perhaps as a superclass of TermScorer.
> The javadocs in org.apache.lucene.search using matching vs non-zero score:
> I'll investigate this soon, and provide a patch when necessary.
> An early version of the patches of LUCENE-584 contained a class Matcher,
> that differs from the current DocIdSet in that Matcher has an explain() 
> method.
> It remains to be seen whether such a Matcher could be useful between
> DocIdSet and Scorer.
> The semantics of scorer.skipTo(scorer.doc()) was discussed briefly.
> This was also discussed at another issue recently, so perhaps it is wortwhile 
> to open a separate issue for this.
> Skipping on a SortedVIntList is done using linear search, this could be 
> improved by adding multilevel skiplist info much like in the Lucene index for 
> documents containing a term.
> One comment by me of 3 Dec 2008:
> A few complete (test) classes are deprecated, it might be good to add the 
> target release for removal there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Assigned: (LUCENE-1187) Things to be done now that Filter is independent from BitSet

2008-03-26 Thread Michael Busch (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch reassigned LUCENE-1187:
-

Assignee: Michael Busch

> Things to be done now that Filter is independent from BitSet
> 
>
> Key: LUCENE-1187
> URL: https://issues.apache.org/jira/browse/LUCENE-1187
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*, Search
>Reporter: Paul Elschot
>Assignee: Michael Busch
>Priority: Minor
> Attachments: BooleanFilter20080325.patch, 
> ChainedFilterAndCachingFilterTest.patch, Contrib20080325.patch, 
> Contrib20080326.patch, javadocsZero2Match.patch, OpenBitSetDISI-20080322.patch
>
>
> (Aside: where is the documentation on how to mark up text in jira comments?)
> The following things are left over after LUCENE-584 :
> For Lucene 3.0  Filter.bits() will have to be removed.
> There is a CHECKME in IndexSearcher about using ConjunctionScorer to have the 
> boolean behaviour of a Filter.
> I have not looked into Filter caching yet, but I suppose there will be some 
> room for improvement there.
> Iirc the current core has moved to use OpenBitSetFilter and that is probably 
> what is being cached.
> In some cases it might be better to cache a SortedVIntList instead.
> Boolean logic on DocIdSetIterator is already available for Scorers (that 
> inherit from DocIdSetIterator) in the search package. This is currently 
> implemented by ConjunctionScorer, DisjunctionSumScorer,
> ReqOptSumScorer and ReqExclScorer.
> Boolean logic on BitSets is available in contrib/misc and contrib/queries
> DisjunctionSumScorer calls score() on its subscorers before the score value 
> actually needed.
> This could be a reason to introduce a DisjunctionDocIdSetIterator, perhaps as 
> a superclass of DisjunctionSumScorer.
> To fully implement non scoring queries a TermDocIdSetIterator will be needed, 
> perhaps as a superclass of TermScorer.
> The javadocs in org.apache.lucene.search using matching vs non-zero score:
> I'll investigate this soon, and provide a patch when necessary.
> An early version of the patches of LUCENE-584 contained a class Matcher,
> that differs from the current DocIdSet in that Matcher has an explain() 
> method.
> It remains to be seen whether such a Matcher could be useful between
> DocIdSet and Scorer.
> The semantics of scorer.skipTo(scorer.doc()) was discussed briefly.
> This was also discussed at another issue recently, so perhaps it is wortwhile 
> to open a separate issue for this.
> Skipping on a SortedVIntList is done using linear search, this could be 
> improved by adding multilevel skiplist info much like in the Lucene index for 
> documents containing a term.
> One comment by me of 3 Dec 2008:
> A few complete (test) classes are deprecated, it might be good to add the 
> target release for removal there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1187) Things to be done now that Filter is independent from BitSet

2008-03-26 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582414#action_12582414
 ] 

Michael Busch commented on LUCENE-1187:
---

Thanks for your patches, Paul. I'll be traveling the next days, but I'll try to 
look at the patches next week.



> Things to be done now that Filter is independent from BitSet
> 
>
> Key: LUCENE-1187
> URL: https://issues.apache.org/jira/browse/LUCENE-1187
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*, Search
>Reporter: Paul Elschot
>Assignee: Michael Busch
>Priority: Minor
> Attachments: BooleanFilter20080325.patch, 
> ChainedFilterAndCachingFilterTest.patch, Contrib20080325.patch, 
> Contrib20080326.patch, javadocsZero2Match.patch, OpenBitSetDISI-20080322.patch
>
>
> (Aside: where is the documentation on how to mark up text in jira comments?)
> The following things are left over after LUCENE-584 :
> For Lucene 3.0  Filter.bits() will have to be removed.
> There is a CHECKME in IndexSearcher about using ConjunctionScorer to have the 
> boolean behaviour of a Filter.
> I have not looked into Filter caching yet, but I suppose there will be some 
> room for improvement there.
> Iirc the current core has moved to use OpenBitSetFilter and that is probably 
> what is being cached.
> In some cases it might be better to cache a SortedVIntList instead.
> Boolean logic on DocIdSetIterator is already available for Scorers (that 
> inherit from DocIdSetIterator) in the search package. This is currently 
> implemented by ConjunctionScorer, DisjunctionSumScorer,
> ReqOptSumScorer and ReqExclScorer.
> Boolean logic on BitSets is available in contrib/misc and contrib/queries
> DisjunctionSumScorer calls score() on its subscorers before the score value 
> actually needed.
> This could be a reason to introduce a DisjunctionDocIdSetIterator, perhaps as 
> a superclass of DisjunctionSumScorer.
> To fully implement non scoring queries a TermDocIdSetIterator will be needed, 
> perhaps as a superclass of TermScorer.
> The javadocs in org.apache.lucene.search using matching vs non-zero score:
> I'll investigate this soon, and provide a patch when necessary.
> An early version of the patches of LUCENE-584 contained a class Matcher,
> that differs from the current DocIdSet in that Matcher has an explain() 
> method.
> It remains to be seen whether such a Matcher could be useful between
> DocIdSet and Scorer.
> The semantics of scorer.skipTo(scorer.doc()) was discussed briefly.
> This was also discussed at another issue recently, so perhaps it is wortwhile 
> to open a separate issue for this.
> Skipping on a SortedVIntList is done using linear search, this could be 
> improved by adding multilevel skiplist info much like in the Lucene index for 
> documents containing a term.
> One comment by me of 3 Dec 2008:
> A few complete (test) classes are deprecated, it might be good to add the 
> target release for removal there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)

2008-03-26 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582420#action_12582420
 ] 

Michael McCandless commented on LUCENE-1231:


{quote}
How would this compare to making the storing of position and freq optional for 
a field? Then one could have an indexed field with a payload or boost but with 
no freq (or positions, since freq is required for positions). Would that be 
equivalent?
{quote}

I think this would be very similar, except maybe:

  * This proposal would allow for optional non-sparse, fixed-length
storage (ie, don't include the docID since all docs have a
payload, and the payload is always the same length).  EG norms are
like this.

  * [From the thread linked above] would allow for binary storage of
field values.  EG for int fields you would store the 4 bytes per
value, and populating the cache would be much faster than the
FieldCache now (which must re-parse Strings -> ints, and, must
walk the terms to "reconstruct" the forward index).

  * This proposal may allow for updating these values, like we can do
with norms today.  Maybe this can only work if the field is
non-sparse and perhaps if you've loaded it into the FieldCache?

This would tie into the LUCENE-831, so that you could load these
fields entirely in RAM, incrementally update them from a reopen, etc.

> Column-stride fields (aka per-document Payloads)
> 
>
> Key: LUCENE-1231
> URL: https://issues.apache.org/jira/browse/LUCENE-1231
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.4
>
>
> This new feature has been proposed and discussed here:
> http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results
> Currently it is possible in Lucene to store data as stored fields or as 
> payloads.
> Stored fields provide good performance if you want to load all fields for one
> document, because this is an sequential I/O operation.
> If you however want to load the data from one field for a large number of 
> documents, then stored fields perform quite badly, because lot's of I/O seeks 
> might have to be performed. 
> A better way to do this is using payloads. By creating a "special" posting 
> list
> that has one posting with payload for each document you can "simulate" a 
> column-
> stride field. The performance is significantly better compared to stored 
> fields,
> however still not optimal. The reason is that for each document the freq 
> value,
> which is in this particular case always 1, has to be decoded, also one 
> position
> value, which is always 0, has to be loaded.
> As a solution we want to add real column-stride fields to Lucene. A possible
> format for the new data structure could look like this (CSD stands for column-
> stride data, once we decide for a final name for this feature we can change 
> this):
> CSDList --> FixedLengthList |  
> FixedLengthList --> ^SegSize 
> VariableLengthList -->  
> Payload --> Byte^PayloadLength 
> PayloadLength --> VInt 
> SkipList --> see frq.file
> We distinguish here between the fixed length and the variable length cases. To
> allow flexibility, Lucene could automatically pick the "right" data 
> structure. 
> This could work like this: When the DocumentsWriter writes a segment it 
> checks 
> whether all values of a field have the same length. If yes, it stores them as 
> FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger 
> merges two or more segments it checks if all segments have a FixedLengthList 
> with the same length for a column-stride field. If not, it writes a 
> VariableLengthList to the new segment. 
> Once this feature is implemented, we should think about making the column-
> stride fields updateable, similar to the norms. This will be a very powerful
> feature that can for example be used for low-latency tagging of documents.
> Other use cases:
> - replace norms
> - allow to store boost values separately from norms
> - as input for the FieldCache, thus providing significantly improved loading
> performance (see LUCENE-831)
> Things that need to be done here:
> - decide for a name for this feature :) - I think "column-stride fields" was
> liked better than "per-document payloads"
> - Design an API for this feature. We should keep in mind here that these 
> fields are supposed to be updateable.
> - Define datastructures.
> I would like to get this feature into 2.4. Feedback about the open questions
> is very welcome so that we can finalize the design soon and start 
> implementing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

2008-03-26 Thread Marvin Humphrey




Michael McCandless resolved LUCENE-510.


Congratulations.  :)

When I wrote my initial patch, I saw a performance degradation of c.  
30% in my indexing benchmarks.  Repeated reallocation was presumably  
one culprit: when length in Java chars is stored in the index, you  
only need to allocate once, whereas when reading in UTF-8, you can't  
know just how much memory you need until the read completes.   
Furthermore, at write-time, you can't look at something composed of 16- 
bit chars and know what the byte-length of its UTF-8 representation  
will be without pre-scanning.


How did you solve those problems?  Are the string diffs and  
comparisons now performed against raw bytes, so that fewer conversions  
are needed?


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-03-26 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582422#action_12582422
 ] 

Michael McCandless commented on LUCENE-831:
---


One question here: should we switch to a method call, instead of a
straight array, to retrieve a cached value for a doc?

If we did that, then MultiSearchers would forward the request to the
right IndexReader.

The benefit then is that reopen() of a reader would not have to
allocate & bulk copy massive arrays when updating the caches.  It
would keep the cost of reopen closer to the size of the new segments.
And this way the old reader & the new one would not double-allocate
the RAM required to hold the common parts of the cache.

We could always still provide a "give me the full array" fallback if
people really wanted that (and were willing to accept the cost).



> Complete overhaul of FieldCache API/Implementation
> --
>
> Key: LUCENE-831
> URL: https://issues.apache.org/jira/browse/LUCENE-831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Hoss Man
>Assignee: Michael Busch
> Fix For: 2.4
>
> Attachments: fieldcache-overhaul.032208.diff, 
> fieldcache-overhaul.diff, fieldcache-overhaul.diff
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
> a) eliminate global static map keyed on IndexReader (thus
> eliminating synch block between completley independent IndexReaders)
> b) allow more customization of cache management (ie: use 
> expiration/replacement strategies, disk backed caches, etc)
> c) allow people to define custom cache data logic (ie: custom
> parsers, complex datatypes, etc... anything tied to a reader)
> d) allow people to inspect what's in a cache (list of CacheKeys) for
> an IndexReader so a new IndexReader can be likewise warmed. 
> e) Lend support for smarter cache management if/when
> IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
> the new implementation, so there is no redundent caching as client code
> migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

2008-03-26 Thread Michael McCandless



Marvin Humphrey wrote:



Michael McCandless resolved LUCENE-510.


Congratulations.  :)


Thanks.  I didn't quite realize what I was getting myself into when I  
said "yes" on that issue!


When I wrote my initial patch, I saw a performance degradation of  
c. 30% in my indexing benchmarks.


I think it was 20%.

Repeated reallocation was presumably one culprit: when length in  
Java chars is stored in the index, you only need to allocate once,  
whereas when reading in UTF-8, you can't know just how much memory  
you need until the read completes.  Furthermore, at write-time, you  
can't look at something composed of 16-bit chars and know what the  
byte-length of its UTF-8 representation will be without pre-scanning.


Right, not doing allocations was pretty much it (the getBytes method  
of String was most of the slowdown I think).  I was also able to  
eliminate another per-term scan we were doing in DocumentsWriter and  
fold it into the conversion.


I ended up creating custom conversion methods (UTF8toUTF16 & vice- 
versa) to do this conversion into a re-used byte[] or char[], which  
grow as needed, then I just bulk-write the bytes.  I think this is  
not much slower than before (modified UTF8) since it also had to go  
character by character w/ ifs inside that inner loop.


I'm less happy with the 11% slowdown on TermEnum, and that's even  
with the optimization to incrementally decode only the "new" UTF-8  
bytes as we are reading the changed suffix of each term, reusing the  
already-decoded UTF16 chars from the previous term.  This will  
slowdown populating a FieldCache, which is already slow.  But  
LUCENE-831 and LUCENE-1231 should fix that.


Are the string diffs and comparisons now performed against raw  
bytes, so that fewer conversions are needed?


Alas, not yet: Lucene still uses UTF16 java chars internally.  The  
conversion to UTF-8 happens "at the last minute" when writing, and  
"immediately" when reading.


I started exploring keeping UTF-8 bytes further in, but it quickly  
got messy because it would require changing how the term infos are  
sorted to be unicode code point order.  Comparing bytes in UTF-8 is  
the same as comparing unicode code points, which is nice.  But  
comparing UTF-16 values is almost but not quite the same.   So  
suddenly everywhere where a string comparison takes place I had to  
assess whether that comparison should be by unicode code point, and  
call our own method for doing so.  It quickly became a "big" project  
so I ran back to sorting by UTF-16 value.


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

2008-03-26 Thread Yonik Seeley

On Wed, Mar 26, 2008 at 5:22 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
>  > Are the string diffs and comparisons now performed against raw
>  > bytes, so that fewer conversions are needed?
>
>  Alas, not yet: Lucene still uses UTF16 java chars internally.  The
>  conversion to UTF-8 happens "at the last minute" when writing, and
>  "immediately" when reading.
>
>  I started exploring keeping UTF-8 bytes further in, but it quickly
>  got messy because it would require changing how the term infos are
>  sorted to be unicode code point order.  Comparing bytes in UTF-8 is
>  the same as comparing unicode code points, which is nice.  But
>  comparing UTF-16 values is almost but not quite the same.   So
>  suddenly everywhere where a string comparison takes place I had to
>  assess whether that comparison should be by unicode code point, and
>  call our own method for doing so.  It quickly became a "big" project
>  so I ran back to sorting by UTF-16 value.

Hmmm, can't we always do it by unicode code point?
When do we need UTF-16 order?

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)

2008-03-26 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582442#action_12582442
 ] 

Doug Cutting commented on LUCENE-1231:
--

So there are a number of features these fields would have that differ from 
other fields:
 - no freq
 - no positions
 - non-sparse representation
 - binary values (is this different from payloads?)
 - updateable

My question is whether it is best to bundle these together as a new kind of 
field, or add these as optional features of ordinary fields, or some 
combination.  There are a certain bundles that may work well together: e.g., a 
dense array of fixed-size, updateable binary values w/o freqs or positions.  
And not all combinations may be sensible or easy to implement.  But most of 
these would also be useful ala carte too, e.g., no-freqs, no-positions and 
(perhaps) updateable.

BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a 
reasonable API for updating sparse fields.

> Column-stride fields (aka per-document Payloads)
> 
>
> Key: LUCENE-1231
> URL: https://issues.apache.org/jira/browse/LUCENE-1231
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.4
>
>
> This new feature has been proposed and discussed here:
> http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results
> Currently it is possible in Lucene to store data as stored fields or as 
> payloads.
> Stored fields provide good performance if you want to load all fields for one
> document, because this is an sequential I/O operation.
> If you however want to load the data from one field for a large number of 
> documents, then stored fields perform quite badly, because lot's of I/O seeks 
> might have to be performed. 
> A better way to do this is using payloads. By creating a "special" posting 
> list
> that has one posting with payload for each document you can "simulate" a 
> column-
> stride field. The performance is significantly better compared to stored 
> fields,
> however still not optimal. The reason is that for each document the freq 
> value,
> which is in this particular case always 1, has to be decoded, also one 
> position
> value, which is always 0, has to be loaded.
> As a solution we want to add real column-stride fields to Lucene. A possible
> format for the new data structure could look like this (CSD stands for column-
> stride data, once we decide for a final name for this feature we can change 
> this):
> CSDList --> FixedLengthList |  
> FixedLengthList --> ^SegSize 
> VariableLengthList -->  
> Payload --> Byte^PayloadLength 
> PayloadLength --> VInt 
> SkipList --> see frq.file
> We distinguish here between the fixed length and the variable length cases. To
> allow flexibility, Lucene could automatically pick the "right" data 
> structure. 
> This could work like this: When the DocumentsWriter writes a segment it 
> checks 
> whether all values of a field have the same length. If yes, it stores them as 
> FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger 
> merges two or more segments it checks if all segments have a FixedLengthList 
> with the same length for a column-stride field. If not, it writes a 
> VariableLengthList to the new segment. 
> Once this feature is implemented, we should think about making the column-
> stride fields updateable, similar to the norms. This will be a very powerful
> feature that can for example be used for low-latency tagging of documents.
> Other use cases:
> - replace norms
> - allow to store boost values separately from norms
> - as input for the FieldCache, thus providing significantly improved loading
> performance (see LUCENE-831)
> Things that need to be done here:
> - decide for a name for this feature :) - I think "column-stride fields" was
> liked better than "per-document payloads"
> - Design an API for this feature. We should keep in mind here that these 
> fields are supposed to be updateable.
> - Define datastructures.
> I would like to get this feature into 2.4. Feedback about the open questions
> is very welcome so that we can finalize the design soon and start 
> implementing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-03-26 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582443#action_12582443
 ] 

Michael Busch commented on LUCENE-831:
--

{quote}
The benefit then is that reopen() of a reader would not have to
allocate & bulk copy massive arrays when updating the caches. It
would keep the cost of reopen closer to the size of the new segments.
{quote}

I agree, Mike. Currently during reopen() the MultiSegmentReader 
allocates a new norms array with size maxDoc(), which is, as you said,
inefficient if only some (maybe even small) segments changed.

The method call might be a little slower than the array lookup, but
I doubt that this would be very significant. We can make this change for
the norms and run performance tests to measure the slowdown.

> Complete overhaul of FieldCache API/Implementation
> --
>
> Key: LUCENE-831
> URL: https://issues.apache.org/jira/browse/LUCENE-831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Hoss Man
>Assignee: Michael Busch
> Fix For: 2.4
>
> Attachments: fieldcache-overhaul.032208.diff, 
> fieldcache-overhaul.diff, fieldcache-overhaul.diff
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
> a) eliminate global static map keyed on IndexReader (thus
> eliminating synch block between completley independent IndexReaders)
> b) allow more customization of cache management (ie: use 
> expiration/replacement strategies, disk backed caches, etc)
> c) allow people to define custom cache data logic (ie: custom
> parsers, complex datatypes, etc... anything tied to a reader)
> d) allow people to inspect what's in a cache (list of CacheKeys) for
> an IndexReader so a new IndexReader can be likewise warmed. 
> e) Lend support for smarter cache management if/when
> IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
> the new implementation, so there is no redundent caching as client code
> migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

2008-03-26 Thread Michael McCandless

Yonik Seeley <[EMAIL PROTECTED]> wrote:

>  Hmmm, can't we always do it by unicode code point?
>  When do we need UTF-16 order?

In theory, we can.  I think the sort order doesn't matter much, as
long as everyone (writers & readers) agree what it is.  I think
unicode code point order is more "standards compliant" too.

A big benefit is then we could leave things (eg TermBuffer and maybe
eventually Term, FieldCache) as UTF8 bytes and save on the conversion
cost when reading.

But I don't think Java provides a way to do this comparison?  However
it's not hard to implement your own:

  http://www.icu-project.org/docs/papers/utf16_code_point_order.html

But then I worried about how much slower that code is than
String.compareTo, and, I found alot of places where innocent compareTo
or < or > needed to be changed to this method call.  Field name
comparisons would have to be fixed too.  Then for backwards
compatibility all of these places that do comparisons would have to
fallback to the Java way when interacting with an older segment.

I think we can still explore this?  It just seemed way too big to
glomm into the already-big changes in LUCENE-510.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

2008-03-26 Thread Yonik Seeley

On Wed, Mar 26, 2008 at 6:06 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
> Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
>  >  Hmmm, can't we always do it by unicode code point?
>  >  When do we need UTF-16 order?
>
>  In theory, we can.  I think the sort order doesn't matter much, as
>  long as everyone (writers & readers) agree what it is.  I think
>  unicode code point order is more "standards compliant" too.
>
>  A big benefit is then we could leave things (eg TermBuffer and maybe
>  eventually Term, FieldCache) as UTF8 bytes and save on the conversion
>  cost when reading.
>
>  But I don't think Java provides a way to do this comparison?  However
>  it's not hard to implement your own:
>
>   http://www.icu-project.org/docs/papers/utf16_code_point_order.html

Not sure I follow... you just do a byte-by-byte comparison right?  For
ASCII, this should be slightly faster (same number of comparisons,
less memory space and hence less cache space overall).

>  But then I worried about how much slower that code is than
>  String.compareTo, and, I found alot of places where innocent compareTo
>  or < or > needed to be changed to this method call.  Field name
>  comparisons would have to be fixed too.  Then for backwards
>  compatibility all of these places that do comparisons would have to
>  fallback to the Java way when interacting with an older segment.

Oh... older segments.  Yeah, I was speaking "theoretically".

>  I think we can still explore this?  It just seemed way too big to
>  glomm into the already-big changes in LUCENE-510.

Yeah, I was thinking of some of this more along the lines of Lucene 3.

A term could contain a byte array instead of a String.  A String
constructor would convert to UTF8 and then do lookups in the index
(simple byte comparisons, no charset encoding).  A byte constructor
for Term would also be allowed.  Things like TermEnumerators would
keep everything in bytes, the tii would be in bytes, etc.

One could also think about ways to directly index bytes too.
Is it all worth it?  I really don't know.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

2008-03-26 Thread Michael McCandless



Yonik Seeley wrote:

On Wed, Mar 26, 2008 at 6:06 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:

Yonik Seeley <[EMAIL PROTECTED]> wrote:


 Hmmm, can't we always do it by unicode code point?
 When do we need UTF-16 order?


 In theory, we can.  I think the sort order doesn't matter much, as
 long as everyone (writers & readers) agree what it is.  I think
 unicode code point order is more "standards compliant" too.

 A big benefit is then we could leave things (eg TermBuffer and maybe
 eventually Term, FieldCache) as UTF8 bytes and save on the  
conversion

 cost when reading.

 But I don't think Java provides a way to do this comparison?   
However

 it's not hard to implement your own:

  http://www.icu-project.org/docs/papers/utf16_code_point_order.html


Not sure I follow... you just do a byte-by-byte comparison right?  For
ASCII, this should be slightly faster (same number of comparisons,
less memory space and hence less cache space overall).


Sorry, you're right: if you're working with byte[] at the time, a  
byte by byte comparison of UTF8 gives you the same order as unicode  
code point.


It's when you need to compare a String or char[] to one another, or  
to a UTF8 byte[], that you need that code.



 But then I worried about how much slower that code is than
 String.compareTo, and, I found alot of places where innocent  
compareTo

 or < or > needed to be changed to this method call.  Field name
 comparisons would have to be fixed too.  Then for backwards
 compatibility all of these places that do comparisons would have to
 fallback to the Java way when interacting with an older segment.


Oh... older segments.  Yeah, I was speaking "theoretically".


Yeah.


 I think we can still explore this?  It just seemed way too big to
 glomm into the already-big changes in LUCENE-510.


Yeah, I was thinking of some of this more along the lines of Lucene 3.
A term could contain a byte array instead of a String.  A String
constructor would convert to UTF8 and then do lookups in the index
(simple byte comparisons, no charset encoding).  A byte constructor
for Term would also be allowed.  Things like TermEnumerators would
keep everything in bytes, the tii would be in bytes, etc.


Yup.


One could also think about ways to directly index bytes too.


Right, DocumentsWriter could hold its terms in byte[] and save time/ 
space when terms are ascii.



Is it all worth it?  I really don't know.


Right, that's where I started to wonder.  It felt very much like I  
was "going against the grain of Java" as the changes started to pile  
up ...


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)

2008-03-26 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582464#action_12582464
 ] 

Michael McCandless commented on LUCENE-1231:


Sorry you're right: the payload is the binary data.

{quote}
So there are a number of features these fields would have that differ from 
other fields:
{quote}

Maybe add "stored in its own file" or some such, to that list.  Ie to
efficiently update field X I would think you want it stored in its own
file.  We would then fully write a new geneation of that file whenever
it had changes.

I agree it would be great to implement this as "flexible indexing",
such that these are simply a-la-cart options on how the field is
indexed, rather than make a new specialized kind of field that just
does one of these "combinations".  But I haven't wrapped my brain
around what all this will entail... it's a biggie!

{quote}
BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a 
reasonable API for updating sparse fields.
{quote}
I like that!



> Column-stride fields (aka per-document Payloads)
> 
>
> Key: LUCENE-1231
> URL: https://issues.apache.org/jira/browse/LUCENE-1231
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.4
>
>
> This new feature has been proposed and discussed here:
> http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results
> Currently it is possible in Lucene to store data as stored fields or as 
> payloads.
> Stored fields provide good performance if you want to load all fields for one
> document, because this is an sequential I/O operation.
> If you however want to load the data from one field for a large number of 
> documents, then stored fields perform quite badly, because lot's of I/O seeks 
> might have to be performed. 
> A better way to do this is using payloads. By creating a "special" posting 
> list
> that has one posting with payload for each document you can "simulate" a 
> column-
> stride field. The performance is significantly better compared to stored 
> fields,
> however still not optimal. The reason is that for each document the freq 
> value,
> which is in this particular case always 1, has to be decoded, also one 
> position
> value, which is always 0, has to be loaded.
> As a solution we want to add real column-stride fields to Lucene. A possible
> format for the new data structure could look like this (CSD stands for column-
> stride data, once we decide for a final name for this feature we can change 
> this):
> CSDList --> FixedLengthList |  
> FixedLengthList --> ^SegSize 
> VariableLengthList -->  
> Payload --> Byte^PayloadLength 
> PayloadLength --> VInt 
> SkipList --> see frq.file
> We distinguish here between the fixed length and the variable length cases. To
> allow flexibility, Lucene could automatically pick the "right" data 
> structure. 
> This could work like this: When the DocumentsWriter writes a segment it 
> checks 
> whether all values of a field have the same length. If yes, it stores them as 
> FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger 
> merges two or more segments it checks if all segments have a FixedLengthList 
> with the same length for a column-stride field. If not, it writes a 
> VariableLengthList to the new segment. 
> Once this feature is implemented, we should think about making the column-
> stride fields updateable, similar to the norms. This will be a very powerful
> feature that can for example be used for low-latency tagging of documents.
> Other use cases:
> - replace norms
> - allow to store boost values separately from norms
> - as input for the FieldCache, thus providing significantly improved loading
> performance (see LUCENE-831)
> Things that need to be done here:
> - decide for a name for this feature :) - I think "column-stride fields" was
> liked better than "per-document payloads"
> - Design an API for this feature. We should keep in mind here that these 
> fields are supposed to be updateable.
> - Define datastructures.
> I would like to get this feature into 2.4. Feedback about the open questions
> is very welcome so that we can finalize the design soon and start 
> implementing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-03-26 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582471#action_12582471
 ] 

Mark Miller commented on LUCENE-831:


>If you're going to incrementally update a FieldCache of a MultiReader, it's 
>the same issue... can't merge the ordinals without the original (String) 
>>values.

That is a great point.

>should we switch to a method call, instead of a straight array, to retrieve a 
>cached value for a doc?

Sounds like a great idea to me. Solves the StringIndex merge and eliminates all 
merge costs at the price of a method call per access.


> Complete overhaul of FieldCache API/Implementation
> --
>
> Key: LUCENE-831
> URL: https://issues.apache.org/jira/browse/LUCENE-831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Hoss Man
>Assignee: Michael Busch
> Fix For: 2.4
>
> Attachments: fieldcache-overhaul.032208.diff, 
> fieldcache-overhaul.diff, fieldcache-overhaul.diff
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
> a) eliminate global static map keyed on IndexReader (thus
> eliminating synch block between completley independent IndexReaders)
> b) allow more customization of cache management (ie: use 
> expiration/replacement strategies, disk backed caches, etc)
> c) allow people to define custom cache data logic (ie: custom
> parsers, complex datatypes, etc... anything tied to a reader)
> d) allow people to inspect what's in a cache (list of CacheKeys) for
> an IndexReader so a new IndexReader can be likewise warmed. 
> e) Lend support for smarter cache management if/when
> IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
> the new implementation, so there is no redundent caching as client code
> migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-03-26 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582480#action_12582480
 ] 

Mark Miller commented on LUCENE-831:


Hmm...how do we avoid having to pull the cached field values through a sync on 
every call? The field data has to be cached...and the method to return the 
single cached field value has to be multi-threaded...

> Complete overhaul of FieldCache API/Implementation
> --
>
> Key: LUCENE-831
> URL: https://issues.apache.org/jira/browse/LUCENE-831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Hoss Man
>Assignee: Michael Busch
> Fix For: 2.4
>
> Attachments: fieldcache-overhaul.032208.diff, 
> fieldcache-overhaul.diff, fieldcache-overhaul.diff
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
> a) eliminate global static map keyed on IndexReader (thus
> eliminating synch block between completley independent IndexReaders)
> b) allow more customization of cache management (ie: use 
> expiration/replacement strategies, disk backed caches, etc)
> c) allow people to define custom cache data logic (ie: custom
> parsers, complex datatypes, etc... anything tied to a reader)
> d) allow people to inspect what's in a cache (list of CacheKeys) for
> an IndexReader so a new IndexReader can be likewise warmed. 
> e) Lend support for smarter cache management if/when
> IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
> the new implementation, so there is no redundent caching as client code
> migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1245) MultiFieldQueryParser is not friendly for overriding getFieldQuery(String,String,int)

2008-03-26 Thread Trejkaz (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trejkaz updated LUCENE-1245:


Lucene Fields: [New, Patch Available]  (was: [New])
  Summary: MultiFieldQueryParser is not friendly for overriding 
getFieldQuery(String,String,int)  (was: MultiFieldQueryParser is not friendly 
for overriding)

(Updating title to be more specific about what wasn't friendly.)

> MultiFieldQueryParser is not friendly for overriding 
> getFieldQuery(String,String,int)
> -
>
> Key: LUCENE-1245
> URL: https://issues.apache.org/jira/browse/LUCENE-1245
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.3.2
>Reporter: Trejkaz
>
> LUCENE-1213 fixed an issue in MultiFieldQueryParser where the slop parameter 
> wasn't being properly applied.  Problem is, the fix which eventually got 
> committed is calling super.getFieldQuery(String,String), bypassing any 
> possibility of customising the query behaviour.
> This should be relatively simply fixable by modifying 
> getFieldQuery(String,String,int) to, if field is null, recursively call 
> getFieldQuery(String,String,int) instead of setting the slop itself.  This 
> gives subclasses which override either getFieldQuery method a chance to do 
> something different.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1245) MultiFieldQueryParser is not friendly for overriding getFieldQuery(String,String,int)

2008-03-26 Thread Trejkaz (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trejkaz updated LUCENE-1245:


Attachment: multifield.patch

Fix makes getFieldQuery(String,String) and getFieldQuery(String,String,int) 
work more or less the same.  Neither calls methods on super and thus overriding 
the methods will work (and does.  Although I have no unit test for this yet.)

Common boosting logic is extracted to an applyBoost method.  Also the check for 
the clauses being empty, I have removed... as getBooleanQuery appears to be 
doing that already.


> MultiFieldQueryParser is not friendly for overriding 
> getFieldQuery(String,String,int)
> -
>
> Key: LUCENE-1245
> URL: https://issues.apache.org/jira/browse/LUCENE-1245
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.3.2
>Reporter: Trejkaz
> Attachments: multifield.patch
>
>
> LUCENE-1213 fixed an issue in MultiFieldQueryParser where the slop parameter 
> wasn't being properly applied.  Problem is, the fix which eventually got 
> committed is calling super.getFieldQuery(String,String), bypassing any 
> possibility of customising the query behaviour.
> This should be relatively simply fixable by modifying 
> getFieldQuery(String,String,int) to, if field is null, recursively call 
> getFieldQuery(String,String,int) instead of setting the slop itself.  This 
> gives subclasses which override either getFieldQuery method a chance to do 
> something different.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1245) MultiFieldQueryParser is not friendly for overriding getFieldQuery(String,String,int)

2008-03-26 Thread Trejkaz (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582490#action_12582490
 ] 

Trejkaz commented on LUCENE-1245:
-

Here's an example illustrating the way we were using it, although instead of 
changing the query text we're actually returning a different query class -- 
that class isn't in Lucene Core and also it's easier to build up an expected 
query if it's just a TermQuery.

public void testOverrideGetFieldQuery() throws Exception {
String[] fields = { "a", "b" };
QueryParser parser = new MultiFieldQueryParser(fields, new 
StandardAnalyzer()) {
protected Query getFieldQuery(String field, String queryText, int 
slop) throws ParseException {
if (field != null && slop == 1) {
field = "z" + field;
}
return super.getFieldQuery(field, queryText, slop);
}
};

BooleanQuery expected = new BooleanQuery();
expected.add(new TermQuery(new Term("a", "zabc")), 
BooleanClause.Occur.SHOULD);
expected.add(new TermQuery(new Term("b", "zabc")), 
BooleanClause.Occur.SHOULD);
assertEquals("Expected a mangled query", expected, 
parser.parse("\"abc\"~1"));
}


> MultiFieldQueryParser is not friendly for overriding 
> getFieldQuery(String,String,int)
> -
>
> Key: LUCENE-1245
> URL: https://issues.apache.org/jira/browse/LUCENE-1245
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.3.2
>Reporter: Trejkaz
> Attachments: multifield.patch
>
>
> LUCENE-1213 fixed an issue in MultiFieldQueryParser where the slop parameter 
> wasn't being properly applied.  Problem is, the fix which eventually got 
> committed is calling super.getFieldQuery(String,String), bypassing any 
> possibility of customising the query behaviour.
> This should be relatively simply fixable by modifying 
> getFieldQuery(String,String,int) to, if field is null, recursively call 
> getFieldQuery(String,String,int) instead of setting the slop itself.  This 
> gives subclasses which override either getFieldQuery method a chance to do 
> something different.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Issue Comment Edited: (LUCENE-1245) MultiFieldQueryParser is not friendly for overriding getFieldQuery(String,String,int)

2008-03-26 Thread Trejkaz (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582490#action_12582490
 ] 

trejkaz edited comment on LUCENE-1245 at 3/26/08 5:13 PM:
--

Here's an example illustrating the way we were using it, although instead of 
changing the query text we're actually returning a different query class -- 
that class isn't in Lucene Core and also it's easier to build up an expected 
query if it's just a TermQuery.

{noformat}
public void testOverrideGetFieldQuery() throws Exception {
String[] fields = { "a", "b" };
QueryParser parser = new MultiFieldQueryParser(fields, new 
StandardAnalyzer()) {
protected Query getFieldQuery(String field, String queryText, int 
slop) throws ParseException {
if (field != null && slop == 1) {
field = "z" + field;
}
return super.getFieldQuery(field, queryText, slop);
}
};

BooleanQuery expected = new BooleanQuery();
expected.add(new TermQuery(new Term("a", "zabc")), 
BooleanClause.Occur.SHOULD);
expected.add(new TermQuery(new Term("b", "zabc")), 
BooleanClause.Occur.SHOULD);
assertEquals("Expected a mangled query", expected, 
parser.parse("\"abc\"~1"));
}
{noformat}


  was (Author: trejkaz):
Here's an example illustrating the way we were using it, although instead 
of changing the query text we're actually returning a different query class -- 
that class isn't in Lucene Core and also it's easier to build up an expected 
query if it's just a TermQuery.

public void testOverrideGetFieldQuery() throws Exception {
String[] fields = { "a", "b" };
QueryParser parser = new MultiFieldQueryParser(fields, new 
StandardAnalyzer()) {
protected Query getFieldQuery(String field, String queryText, int 
slop) throws ParseException {
if (field != null && slop == 1) {
field = "z" + field;
}
return super.getFieldQuery(field, queryText, slop);
}
};

BooleanQuery expected = new BooleanQuery();
expected.add(new TermQuery(new Term("a", "zabc")), 
BooleanClause.Occur.SHOULD);
expected.add(new TermQuery(new Term("b", "zabc")), 
BooleanClause.Occur.SHOULD);
assertEquals("Expected a mangled query", expected, 
parser.parse("\"abc\"~1"));
}

  
> MultiFieldQueryParser is not friendly for overriding 
> getFieldQuery(String,String,int)
> -
>
> Key: LUCENE-1245
> URL: https://issues.apache.org/jira/browse/LUCENE-1245
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.3.2
>Reporter: Trejkaz
> Attachments: multifield.patch
>
>
> LUCENE-1213 fixed an issue in MultiFieldQueryParser where the slop parameter 
> wasn't being properly applied.  Problem is, the fix which eventually got 
> committed is calling super.getFieldQuery(String,String), bypassing any 
> possibility of customising the query behaviour.
> This should be relatively simply fixable by modifying 
> getFieldQuery(String,String,int) to, if field is null, recursively call 
> getFieldQuery(String,String,int) instead of setting the slop itself.  This 
> gives subclasses which override either getFieldQuery method a chance to do 
> something different.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Issue Comment Edited: (LUCENE-1245) MultiFieldQueryParser is not friendly for overriding getFieldQuery(String,String,int)

2008-03-26 Thread Trejkaz (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582490#action_12582490
 ] 

trejkaz edited comment on LUCENE-1245 at 3/26/08 5:32 PM:
--

Here's an example illustrating the way we were using it, although instead of 
changing the query text we're actually returning a different query class -- 
that class isn't in Lucene Core and also it's easier to build up an expected 
query if it's just a TermQuery.

{noformat}
public void testOverrideGetFieldQuery() throws Exception {
String[] fields = { "a", "b" };
QueryParser parser = new MultiFieldQueryParser(fields, new 
StandardAnalyzer()) {
protected Query getFieldQuery(String field, String queryText, int 
slop) throws ParseException {
if (field != null && slop == 1) {
queryText = "z" + queryText;
}
return super.getFieldQuery(field, queryText, slop);
}
};

BooleanQuery expected = new BooleanQuery();
expected.add(new TermQuery(new Term("a", "zabc")), 
BooleanClause.Occur.SHOULD);
expected.add(new TermQuery(new Term("b", "zabc")), 
BooleanClause.Occur.SHOULD);
assertEquals("Expected a mangled query", expected, 
parser.parse("\"abc\"~1"));
}
{noformat}


  was (Author: trejkaz):
Here's an example illustrating the way we were using it, although instead 
of changing the query text we're actually returning a different query class -- 
that class isn't in Lucene Core and also it's easier to build up an expected 
query if it's just a TermQuery.

{noformat}
public void testOverrideGetFieldQuery() throws Exception {
String[] fields = { "a", "b" };
QueryParser parser = new MultiFieldQueryParser(fields, new 
StandardAnalyzer()) {
protected Query getFieldQuery(String field, String queryText, int 
slop) throws ParseException {
if (field != null && slop == 1) {
field = "z" + field;
}
return super.getFieldQuery(field, queryText, slop);
}
};

BooleanQuery expected = new BooleanQuery();
expected.add(new TermQuery(new Term("a", "zabc")), 
BooleanClause.Occur.SHOULD);
expected.add(new TermQuery(new Term("b", "zabc")), 
BooleanClause.Occur.SHOULD);
assertEquals("Expected a mangled query", expected, 
parser.parse("\"abc\"~1"));
}
{noformat}

  
> MultiFieldQueryParser is not friendly for overriding 
> getFieldQuery(String,String,int)
> -
>
> Key: LUCENE-1245
> URL: https://issues.apache.org/jira/browse/LUCENE-1245
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.3.2
>Reporter: Trejkaz
> Attachments: multifield.patch
>
>
> LUCENE-1213 fixed an issue in MultiFieldQueryParser where the slop parameter 
> wasn't being properly applied.  Problem is, the fix which eventually got 
> committed is calling super.getFieldQuery(String,String), bypassing any 
> possibility of customising the query behaviour.
> This should be relatively simply fixable by modifying 
> getFieldQuery(String,String,int) to, if field is null, recursively call 
> getFieldQuery(String,String,int) instead of setting the slop itself.  This 
> gives subclasses which override either getFieldQuery method a chance to do 
> something different.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Hudson build is back to normal: Lucene-trunk #417

2008-03-26 Thread Apache Hudson Server

See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/417/changes



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Compilation issues in contrib/xml-query-parser ?

[jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Re: Compilation issues in contrib/xml-query-parser ?

[jira] Updated: (LUCENE-1187) Things to be done now that Filter is independent from BitSet

[jira] Updated: (LUCENE-1187) Things to be done now that Filter is independent from BitSet

[jira] Assigned: (LUCENE-1187) Things to be done now that Filter is independent from BitSet

[jira] Commented: (LUCENE-1187) Things to be done now that Filter is independent from BitSet

[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)

[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)

[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

[jira] Updated: (LUCENE-1245) MultiFieldQueryParser is not friendly for overriding getFieldQuery(String,String,int)

[jira] Updated: (LUCENE-1245) MultiFieldQueryParser is not friendly for overriding getFieldQuery(String,String,int)

[jira] Commented: (LUCENE-1245) MultiFieldQueryParser is not friendly for overriding getFieldQuery(String,String,int)

[jira] Issue Comment Edited: (LUCENE-1245) MultiFieldQueryParser is not friendly for overriding getFieldQuery(String,String,int)

[jira] Issue Comment Edited: (LUCENE-1245) MultiFieldQueryParser is not friendly for overriding getFieldQuery(String,String,int)

Hudson build is back to normal: Lucene-trunk #417

26 matches

Site Navigation

Mail list logo

Footer information