[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order
[ https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781837#action_12781837 ] Michael McCandless commented on LUCENE-2086: {quote} I did'nt want to have commits in 3.0, because if I respin a release, I would not be able to only take some of the fixes into 3.0.0. That was the reason. Can you put this also in 2.9.2 if you remove the generics? {quote} OK I'll backport... > When resolving deletes, IW should resolve in term sort order > > > Key: LUCENE-2086 > URL: https://issues.apache.org/jira/browse/LUCENE-2086 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2086.patch > > > See java-dev thread "IndexWriter.updateDocument performance improvement". -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Socket and file locks
Hi, > > > shouldn't active code like that live in the application layer? > > Why? > You can all but guarantee that polling will work at the app layer The application layer may also run with low priority. In operating systems, it's usually the lower layer that have more 'rights' (priority), and not the higher levels (I'm not saying it should be like that in Java). I just think the application layer should not have to deal with write locks or removing write locks. > by the time the original process realizes that it doesn't hold the lock > anymore, the damage could already have been done. Yes, I'm not sure how to best avoid that (with any design). Asking the application layer or the user whether the lock file can be removed is probably more dangerous than trying the best in Lucene. Standby / hibernate: the question is, if the machine process is currently not running, does the process still hold the lock? I think no, because the machine might as well turned off. How to detect whether the machine is turned off versus in hibernate mode? I guess that's a problem for all mechanisms (socket / file lock / background thread). When a hibernated process wakes up again, he thinks he owns the lock. Even if the process checks before each write, it is unsafe: if (isStillLocked()) { write(); } The process could wake up after isStillLocked() but before write(). One protection is: The second process (the one that breaks the lock) would need to work on a copy of the data instead of the original file (it could delete / truncate the orginal file after creating a copy). On Windows, renaming the file might work (not sure); on Linux you probably need to copy the content to a new file. Like that, the awoken process can only destroy inactive data. The question is: do we need to solve this problem? How big is the risk? Instead of solving this problem completely, you could detect it after the fact without much overhead, and throw an exception saying: "data may be corrupt now". PID: With the PID, you could check if the process still runs. Or it could be another process with the same PID (is that possible?), or the same PID but a different machine (when using a network share). It's probably more safe if you can communicate with the lock owner (using TCP/IP or over the file system by deleting/creating a file). Unique id: The easiest solution is to use a UUID (a cryptographically secure random number). That problem _is_ solved (some systems have trouble generating entropy, but there are workarounds). If you anyway have a communication channel to the process, you could ask for this UUID. One you have a communication channel, you can do a lot (reference counting, safely transfer the lock,...). > If the server and the client can't access each other How to find out that the server is still running? My point is: I like to have a secure, automatic way to break the lock if the machine or process is stopped. And from my experience, native file locking is problematic for this. You could also combine solutions (such as: combine the 'open a server socket' solution with 'background thread' solution). I'm not sure if it's worth it to solve the 'hibernate' problem. Regards, Thomas - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)
As DM Smith said, since the bug is longstanding and we are only now just hearing about it, it appears not to be that severe in practice. I guess users don't often mix coord enabled & disabled BQs, that are otherwise identical, in the same cache. So I think we ship 3.0.0 anyways? Mike On Tue, Nov 24, 2009 at 2:26 AM, Uwe Schindler wrote: > Hi all, > > Hoss reported a bug about two fields missing in the equals/hashCode of > BooleanQuery (which exists since 1.9, > https://issues.apache.org/jira/browse/LUCENE-2092). Should I respin 3.0 > because of this or just release it? Speak out load, if you want to respin > (else vote)! > > We will apply the bugfix at least to 2.9.2 and 3.0.1 > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > >> -Original Message- >> From: Uwe Schindler [mailto:u...@thetaphi.de] >> Sent: Sunday, November 22, 2009 4:07 PM >> To: gene...@lucene.apache.org; java-dev@lucene.apache.org >> Subject: [VOTE] Release Apache Lucene Java 3.0.0 (take #2) >> >> Hi, >> >> I have built the artifacts for the final release of "Apache Lucene Java >> 3.0.0" a second time, because of a bug in the TokenStream API (found by >> Shai >> Erera, who wanted to make "bad" things with addAttribute, breaking its >> behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to >> prevent >> stack overflow, LUCENE-2087). They are targeted for release on 2009-11-25. >> >> The artifacts are here: >> http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/ >> >> You find the changes in the corresponding sub folder. The SVN revision is >> 883080, here the manifest with build system info: >> >> Manifest-Version: 1.0 >> Ant-Version: Apache Ant 1.7.0 >> Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.) >> Specification-Title: Lucene Search Engine >> Specification-Version: 3.0.0 >> Specification-Vendor: The Apache Software Foundation >> Implementation-Title: org.apache.lucene >> Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49 >> Implementation-Vendor: The Apache Software Foundation >> X-Compile-Source-JDK: 1.5 >> X-Compile-Target-JDK: 1.5 >> >> Please vote to officially release these artifacts as "Apache Lucene Java >> 3.0.0". >> >> We need at least 3 binding (PMC) votes. >> >> Thanks everyone for all their hard work on this and I am very sorry for >> requesting a vote again, but that's life! Thanks Shai for the pointer to >> the >> bug! >> >> >> >> >> Here is the proposed release note, please edit, if needed: >> -- >> >> Hello Lucene users, >> >> On behalf of the Lucene dev community (a growing community far larger than >> just the committers) I would like to announce the release of Lucene Java >> 3.0: >> >> The new version is mostly a cleanup release without any new features. All >> deprecations targeted to be removed in version 3.0 were removed. If you >> are >> upgrading from version 2.9.1 of Lucene, you have to fix all deprecation >> warnings in your code base to be able to recompile against this version. >> >> This is the first Lucene release with Java 5 as a minimum requirement. The >> API was cleaned up to make use of Java 5's generics, varargs, enums, and >> autoboxing. New users of Lucene are advised to use this version for new >> developments, because it has a clean, type safe new API. Upgrading users >> can >> now remove unnecessary casts and add generics to their code, too. If you >> have not upgraded your installation to Java 5, please read the file >> JRE_VERSION_MIGRATION.txt (please note that this is not related to Lucene >> 3.0, it will also happen with any previous release when you upgrade your >> Java environment). >> >> Lucene 3.0 has some changes regarding compressed fields: 2.9 already >> deprecated compressed fields; support for them was removed now. Lucene 3.0 >> is still able to read indexes with compressed fields, but as soon as >> merges >> occur or the index is optimized, all compressed fields are decompressed >> and >> converted to Field.Store.YES. Because of this, indexes with compressed >> fields can suddenly get larger. >> >> While we generally try and maintain full backwards compatibility between >> major versions, Lucene 3.0 has some minor breaks, mostly related to >> deprecation removal, pointed out in the 'Changes in backwards >> compatibility >> policy' section of CHANGES.txt. Notable are: >> >> - IndexReader.open(Directory) now opens in read-only mode per default >> (this >> method was deprecated because of that in 2.9). The same occurs to >> IndexSearcher. >> >> - Already started in 2.9, core TokenStreams are now made final to enforce >> the decorator pattern. >> >> - If you interrupt an IndexWriter merge thread, IndexWriter now throws an >> unchecked ThreadInterruptedException that extends RuntimeException and >> clears the interrupt status. >> >> -
RE: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)
> As DM Smith said, since the bug is longstanding and we are only now > just hearing about it, it appears not to be that severe in practice. > I guess users don't often mix coord enabled & disabled BQs, that are > otherwise identical, in the same cache. DM Smith also wanted this in 2.9.2, which I think it's fine. The fix is so simple, we could simply merge it to 2.9 branch. And Erick Erickson also noted that this bug is longstanding. > So I think we ship 3.0.0 anyways? +1, I just wanted to ask. Now votes are required, I have zero counting ones until now. Uwe > On Tue, Nov 24, 2009 at 2:26 AM, Uwe Schindler wrote: > > Hi all, > > > > Hoss reported a bug about two fields missing in the equals/hashCode of > > BooleanQuery (which exists since 1.9, > > https://issues.apache.org/jira/browse/LUCENE-2092). Should I respin 3.0 > > because of this or just release it? Speak out load, if you want to > respin > > (else vote)! > > > > We will apply the bugfix at least to 2.9.2 and 3.0.1 > > > > Uwe > > > > - > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > >> -Original Message- > >> From: Uwe Schindler [mailto:u...@thetaphi.de] > >> Sent: Sunday, November 22, 2009 4:07 PM > >> To: gene...@lucene.apache.org; java-dev@lucene.apache.org > >> Subject: [VOTE] Release Apache Lucene Java 3.0.0 (take #2) > >> > >> Hi, > >> > >> I have built the artifacts for the final release of "Apache Lucene Java > >> 3.0.0" a second time, because of a bug in the TokenStream API (found by > >> Shai > >> Erera, who wanted to make "bad" things with addAttribute, breaking its > >> behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to > >> prevent > >> stack overflow, LUCENE-2087). They are targeted for release on 2009-11- > 25. > >> > >> The artifacts are here: > >> http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/ > >> > >> You find the changes in the corresponding sub folder. The SVN revision > is > >> 883080, here the manifest with build system info: > >> > >> Manifest-Version: 1.0 > >> Ant-Version: Apache Ant 1.7.0 > >> Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.) > >> Specification-Title: Lucene Search Engine > >> Specification-Version: 3.0.0 > >> Specification-Vendor: The Apache Software Foundation > >> Implementation-Title: org.apache.lucene > >> Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49 > >> Implementation-Vendor: The Apache Software Foundation > >> X-Compile-Source-JDK: 1.5 > >> X-Compile-Target-JDK: 1.5 > >> > >> Please vote to officially release these artifacts as "Apache Lucene > Java > >> 3.0.0". > >> > >> We need at least 3 binding (PMC) votes. > >> > >> Thanks everyone for all their hard work on this and I am very sorry for > >> requesting a vote again, but that's life! Thanks Shai for the pointer > to > >> the > >> bug! > >> > >> > >> > >> > >> Here is the proposed release note, please edit, if needed: > >> --- > --- > >> > >> Hello Lucene users, > >> > >> On behalf of the Lucene dev community (a growing community far larger > than > >> just the committers) I would like to announce the release of Lucene > Java > >> 3.0: > >> > >> The new version is mostly a cleanup release without any new features. > All > >> deprecations targeted to be removed in version 3.0 were removed. If you > >> are > >> upgrading from version 2.9.1 of Lucene, you have to fix all deprecation > >> warnings in your code base to be able to recompile against this > version. > >> > >> This is the first Lucene release with Java 5 as a minimum requirement. > The > >> API was cleaned up to make use of Java 5's generics, varargs, enums, > and > >> autoboxing. New users of Lucene are advised to use this version for new > >> developments, because it has a clean, type safe new API. Upgrading > users > >> can > >> now remove unnecessary casts and add generics to their code, too. If > you > >> have not upgraded your installation to Java 5, please read the file > >> JRE_VERSION_MIGRATION.txt (please note that this is not related to > Lucene > >> 3.0, it will also happen with any previous release when you upgrade > your > >> Java environment). > >> > >> Lucene 3.0 has some changes regarding compressed fields: 2.9 already > >> deprecated compressed fields; support for them was removed now. Lucene > 3.0 > >> is still able to read indexes with compressed fields, but as soon as > >> merges > >> occur or the index is optimized, all compressed fields are decompressed > >> and > >> converted to Field.Store.YES. Because of this, indexes with compressed > >> fields can suddenly get larger. > >> > >> While we generally try and maintain full backwards compatibility > between > >> major versions, Lucene 3.0 has some minor breaks, mostly related to > >> deprecation removal, pointed out in the 'Changes in backwards > >> compatibility > >> policy' section of CHANGES.txt. Notabl
Re: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)
On Tue, Nov 24, 2009 at 11:09 AM, Uwe Schindler wrote: >> As DM Smith said, since the bug is longstanding and we are only now >> just hearing about it, it appears not to be that severe in practice. >> I guess users don't often mix coord enabled & disabled BQs, that are >> otherwise identical, in the same cache. > > DM Smith also wanted this in 2.9.2, which I think it's fine. The fix is so > simple, we could simply merge it to 2.9 branch. And Erick Erickson also > noted that this bug is longstanding. > >> So I think we ship 3.0.0 anyways? > > +1, I just wanted to ask. Now votes are required, I have zero counting ones > until now. +1 for not respinning 3.0 with this bug. I would also agree with the statements above! +1 for 3.0 even not being a PMC member :) simon > > Uwe > > > >> On Tue, Nov 24, 2009 at 2:26 AM, Uwe Schindler wrote: >> > Hi all, >> > >> > Hoss reported a bug about two fields missing in the equals/hashCode of >> > BooleanQuery (which exists since 1.9, >> > https://issues.apache.org/jira/browse/LUCENE-2092). Should I respin 3.0 >> > because of this or just release it? Speak out load, if you want to >> respin >> > (else vote)! >> > >> > We will apply the bugfix at least to 2.9.2 and 3.0.1 >> > >> > Uwe >> > >> > - >> > Uwe Schindler >> > H.-H.-Meier-Allee 63, D-28213 Bremen >> > http://www.thetaphi.de >> > eMail: u...@thetaphi.de >> > >> >> -Original Message- >> >> From: Uwe Schindler [mailto:u...@thetaphi.de] >> >> Sent: Sunday, November 22, 2009 4:07 PM >> >> To: gene...@lucene.apache.org; java-dev@lucene.apache.org >> >> Subject: [VOTE] Release Apache Lucene Java 3.0.0 (take #2) >> >> >> >> Hi, >> >> >> >> I have built the artifacts for the final release of "Apache Lucene Java >> >> 3.0.0" a second time, because of a bug in the TokenStream API (found by >> >> Shai >> >> Erera, who wanted to make "bad" things with addAttribute, breaking its >> >> behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to >> >> prevent >> >> stack overflow, LUCENE-2087). They are targeted for release on 2009-11- >> 25. >> >> >> >> The artifacts are here: >> >> http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/ >> >> >> >> You find the changes in the corresponding sub folder. The SVN revision >> is >> >> 883080, here the manifest with build system info: >> >> >> >> Manifest-Version: 1.0 >> >> Ant-Version: Apache Ant 1.7.0 >> >> Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.) >> >> Specification-Title: Lucene Search Engine >> >> Specification-Version: 3.0.0 >> >> Specification-Vendor: The Apache Software Foundation >> >> Implementation-Title: org.apache.lucene >> >> Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49 >> >> Implementation-Vendor: The Apache Software Foundation >> >> X-Compile-Source-JDK: 1.5 >> >> X-Compile-Target-JDK: 1.5 >> >> >> >> Please vote to officially release these artifacts as "Apache Lucene >> Java >> >> 3.0.0". >> >> >> >> We need at least 3 binding (PMC) votes. >> >> >> >> Thanks everyone for all their hard work on this and I am very sorry for >> >> requesting a vote again, but that's life! Thanks Shai for the pointer >> to >> >> the >> >> bug! >> >> >> >> >> >> >> >> >> >> Here is the proposed release note, please edit, if needed: >> >> --- >> --- >> >> >> >> Hello Lucene users, >> >> >> >> On behalf of the Lucene dev community (a growing community far larger >> than >> >> just the committers) I would like to announce the release of Lucene >> Java >> >> 3.0: >> >> >> >> The new version is mostly a cleanup release without any new features. >> All >> >> deprecations targeted to be removed in version 3.0 were removed. If you >> >> are >> >> upgrading from version 2.9.1 of Lucene, you have to fix all deprecation >> >> warnings in your code base to be able to recompile against this >> version. >> >> >> >> This is the first Lucene release with Java 5 as a minimum requirement. >> The >> >> API was cleaned up to make use of Java 5's generics, varargs, enums, >> and >> >> autoboxing. New users of Lucene are advised to use this version for new >> >> developments, because it has a clean, type safe new API. Upgrading >> users >> >> can >> >> now remove unnecessary casts and add generics to their code, too. If >> you >> >> have not upgraded your installation to Java 5, please read the file >> >> JRE_VERSION_MIGRATION.txt (please note that this is not related to >> Lucene >> >> 3.0, it will also happen with any previous release when you upgrade >> your >> >> Java environment). >> >> >> >> Lucene 3.0 has some changes regarding compressed fields: 2.9 already >> >> deprecated compressed fields; support for them was removed now. Lucene >> 3.0 >> >> is still able to read indexes with compressed fields, but as soon as >> >> merges >> >> occur or the index is optimized, all compressed fields are decompressed >> >> and >> >> converted to Field.Store.YES. Because of this, indexes with com
Re: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)
+1 to release the current artifacts as 3.0.0! Mike On Tue, Nov 24, 2009 at 5:11 AM, Simon Willnauer wrote: > On Tue, Nov 24, 2009 at 11:09 AM, Uwe Schindler wrote: >>> As DM Smith said, since the bug is longstanding and we are only now >>> just hearing about it, it appears not to be that severe in practice. >>> I guess users don't often mix coord enabled & disabled BQs, that are >>> otherwise identical, in the same cache. >> >> DM Smith also wanted this in 2.9.2, which I think it's fine. The fix is so >> simple, we could simply merge it to 2.9 branch. And Erick Erickson also >> noted that this bug is longstanding. >> >>> So I think we ship 3.0.0 anyways? >> >> +1, I just wanted to ask. Now votes are required, I have zero counting ones >> until now. > +1 for not respinning 3.0 with this bug. I would also agree with the > statements above! > +1 for 3.0 even not being a PMC member :) > > simon >> >> Uwe >> >> >> >>> On Tue, Nov 24, 2009 at 2:26 AM, Uwe Schindler wrote: >>> > Hi all, >>> > >>> > Hoss reported a bug about two fields missing in the equals/hashCode of >>> > BooleanQuery (which exists since 1.9, >>> > https://issues.apache.org/jira/browse/LUCENE-2092). Should I respin 3.0 >>> > because of this or just release it? Speak out load, if you want to >>> respin >>> > (else vote)! >>> > >>> > We will apply the bugfix at least to 2.9.2 and 3.0.1 >>> > >>> > Uwe >>> > >>> > - >>> > Uwe Schindler >>> > H.-H.-Meier-Allee 63, D-28213 Bremen >>> > http://www.thetaphi.de >>> > eMail: u...@thetaphi.de >>> > >>> >> -Original Message- >>> >> From: Uwe Schindler [mailto:u...@thetaphi.de] >>> >> Sent: Sunday, November 22, 2009 4:07 PM >>> >> To: gene...@lucene.apache.org; java-dev@lucene.apache.org >>> >> Subject: [VOTE] Release Apache Lucene Java 3.0.0 (take #2) >>> >> >>> >> Hi, >>> >> >>> >> I have built the artifacts for the final release of "Apache Lucene Java >>> >> 3.0.0" a second time, because of a bug in the TokenStream API (found by >>> >> Shai >>> >> Erera, who wanted to make "bad" things with addAttribute, breaking its >>> >> behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to >>> >> prevent >>> >> stack overflow, LUCENE-2087). They are targeted for release on 2009-11- >>> 25. >>> >> >>> >> The artifacts are here: >>> >> http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/ >>> >> >>> >> You find the changes in the corresponding sub folder. The SVN revision >>> is >>> >> 883080, here the manifest with build system info: >>> >> >>> >> Manifest-Version: 1.0 >>> >> Ant-Version: Apache Ant 1.7.0 >>> >> Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.) >>> >> Specification-Title: Lucene Search Engine >>> >> Specification-Version: 3.0.0 >>> >> Specification-Vendor: The Apache Software Foundation >>> >> Implementation-Title: org.apache.lucene >>> >> Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49 >>> >> Implementation-Vendor: The Apache Software Foundation >>> >> X-Compile-Source-JDK: 1.5 >>> >> X-Compile-Target-JDK: 1.5 >>> >> >>> >> Please vote to officially release these artifacts as "Apache Lucene >>> Java >>> >> 3.0.0". >>> >> >>> >> We need at least 3 binding (PMC) votes. >>> >> >>> >> Thanks everyone for all their hard work on this and I am very sorry for >>> >> requesting a vote again, but that's life! Thanks Shai for the pointer >>> to >>> >> the >>> >> bug! >>> >> >>> >> >>> >> >>> >> >>> >> Here is the proposed release note, please edit, if needed: >>> >> --- >>> --- >>> >> >>> >> Hello Lucene users, >>> >> >>> >> On behalf of the Lucene dev community (a growing community far larger >>> than >>> >> just the committers) I would like to announce the release of Lucene >>> Java >>> >> 3.0: >>> >> >>> >> The new version is mostly a cleanup release without any new features. >>> All >>> >> deprecations targeted to be removed in version 3.0 were removed. If you >>> >> are >>> >> upgrading from version 2.9.1 of Lucene, you have to fix all deprecation >>> >> warnings in your code base to be able to recompile against this >>> version. >>> >> >>> >> This is the first Lucene release with Java 5 as a minimum requirement. >>> The >>> >> API was cleaned up to make use of Java 5's generics, varargs, enums, >>> and >>> >> autoboxing. New users of Lucene are advised to use this version for new >>> >> developments, because it has a clean, type safe new API. Upgrading >>> users >>> >> can >>> >> now remove unnecessary casts and add generics to their code, too. If >>> you >>> >> have not upgraded your installation to Java 5, please read the file >>> >> JRE_VERSION_MIGRATION.txt (please note that this is not related to >>> Lucene >>> >> 3.0, it will also happen with any previous release when you upgrade >>> your >>> >> Java environment). >>> >> >>> >> Lucene 3.0 has some changes regarding compressed fields: 2.9 already >>> >> deprecated compressed fields; support for them was removed now. Lucene >>>
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads
[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781855#action_12781855 ] Uwe Schindler commented on LUCENE-2075: --- Just one question: The cache is initialized with max 1024 entries. Why that number. If we share the cache between multiple threads, maybe we should raise the max size. Or make it configureable? The entries in the cache are not very costly, why not use 8192 or 16384, MTQs would be happy with that? > Share the Term -> TermInfo cache across threads > --- > > Key: LUCENE-2075 > URL: https://issues.apache.org/jira/browse/LUCENE-2075 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch > > > Right now each thread creates its own (thread private) SimpleLRUCache, > holding up to 1024 terms. > This is rather wasteful, since if there are a high number of threads > that come through Lucene, you're multiplying the RAM usage. You're > also cutting way back on likelihood of a cache hit (except the known > multiple times we lookup a term within-query, which uses one thread). > In NRT search we open new SegmentReaders (on tiny segments) often > which each thread must then spend CPU/RAM creating & populating. > Now that we are on 1.5 we can use java.util.concurrent.*, eg > ConcurrentHashMap. One simple approach could be a double-barrel LRU > cache, using 2 maps (primary, secondary). You check the cache by > first checking primary; if that's a miss, you check secondary and if > you get a hit you promote it to primary. Once primary is full you > clear secondary and swap them. > Or... any other suggested approach? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781859#action_12781859 ] Michael McCandless commented on LUCENE-1458: {quote} in trunk, things sort in UTF-16 binary order. in branch, things sort in UTF-8 binary order. these are different... {quote} Ugh! In the back of my mind I almost remembered this... I think this was one reason why I didn't do this back in LUCENE-843 (I think we had discussed this already, then... though maybe I'm suffering from déjà vu). I could swear at one point I had that fixup logic implemented in a UTF-8/16 comparison method... UTF-8 sort order (what flex branch has switched to) is true unicode codepoint sort order, while UTF-16 is not when there are surrogate pairs as well as high (>= U+E000) unicode chars. Sigh So this is definitely a back compat problem. And, unfortunately, even if we like the true codepoint sort order, it's not easy to switch to in a back-compat manner because if we write new segments into an old index, SegmentMerger will be in big trouble when it tries to merge two segments that had sorted the terms differently. I would also prefer true codepoint sort order... but we can't break back compat. Though it would be nice to let the codec control the sort order -- eg then (I think?) the ICU/CollationKeyFilter workaround wouldn't be needed. Fortunately the problem is isolated to how we sort the buffered postings when it's time to flush a new segment, so I think w/ the appropriate fixup logic (eg your comment at https://issues.apache.org/jira/browse/LUCENE-1606?focusedCommentId=12781746&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12781746) when comparing terms in oal.index.TermsHashPerField.comparePostings during that sort, we can get back to UTF-16 sort order. > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file)
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads
[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781863#action_12781863 ] Michael McCandless commented on LUCENE-2075: Well, I just kept 1024 since that's what we currently do ;) OK I just did a rough tally -- I think we're looking at ~100 bytes (on 32 bit JRE) per entry, including CHMs HashEntry, array in CHM, TermInfoAndOrd, Term & its String text. Not to mention DBLRU has 2X multiplier at peak, so 200 bytes. So at 1024 we're looking at ~200KB peak used by this cache already, per segment which is able to saturate that cache... so for a 20 segment index you're at ~4MB additional RAM consumed... so I don't think we should increase this default. Also, I don't think this cache is/should be attempting to achieve a high hit rate *across* queries, only *within* a single query when that query resolves the Term more than once. I think caches that wrap more CPU, like Solr's query cache, are where the app should aim for high hit rate. Maybe we should even decrease the default size here -- what's important is preventing in-fligh queries from evicting one another's cache entries. For NRQ, 1024 is apparently already plenty big for that (relatively few seeks occur). For automaton query, which does lots of seeking, once flex branch lands there is no need for the cache (each lookup is done only once, because the TermsEnum actualEnum is able to seek). Before flex lands, the cache is important, but only for automaton query I think. And honestly I'm still tempted to do away with this cache altogether and create a "query scope", private to each query while it's running, where terms dict (and other places that need to, over time) could store stuff. That'd give a perfect within-query hit rate and wouldn't tie up any long term RAM... > Share the Term -> TermInfo cache across threads > --- > > Key: LUCENE-2075 > URL: https://issues.apache.org/jira/browse/LUCENE-2075 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch > > > Right now each thread creates its own (thread private) SimpleLRUCache, > holding up to 1024 terms. > This is rather wasteful, since if there are a high number of threads > that come through Lucene, you're multiplying the RAM usage. You're > also cutting way back on likelihood of a cache hit (except the known > multiple times we lookup a term within-query, which uses one thread). > In NRT search we open new SegmentReaders (on tiny segments) often > which each thread must then spend CPU/RAM creating & populating. > Now that we are on 1.5 we can use java.util.concurrent.*, eg > ConcurrentHashMap. One simple approach could be a double-barrel LRU > cache, using 2 maps (primary, secondary). You check the cache by > first checking primary; if that's a miss, you check secondary and if > you get a hit you promote it to primary. Once primary is full you > clear secondary and swap them. > Or... any other suggested approach? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2034: Attachment: LUCENE-2034,patch Updated the patch to the current trunk. I have not removed all the deprecated methods in contrib/analyzers yet - we should open another issue for that IMO. Yet this patch still brakes back compatibility as some of the none final contrib analyzers extend StopawareAnalyzer with makes the old tokenstream / reusableTokenstream methods final. IMO this should not block this issues for the following reasons: 1. its in contrib - different story for core 2. it is super easy to port them 3. it make the API cleaner and has less code 4. those analyzers might have to change anyway due to the deprecated methods simon > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > - > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.9 >Reporter: Simon Willnauer >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, > LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses > need to implement at least one of the methodes returning a tokenStream. When > you look at the code it appears to be almost identical if both are > implemented in the same analyzer. Each analyzer defnes the same inner class > (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his > own way of loading them or defines a large number of ctors to load stopwords > from a file, set, arrays etc.. those ctors should be removed / deprecated and > eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2034: Attachment: LUCENE-2034,patch set svn EOF property to native - missed that in the last patch > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > - > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.9 >Reporter: Simon Willnauer >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, > LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses > need to implement at least one of the methodes returning a tokenStream. When > you look at the code it appears to be almost identical if both are > implemented in the same analyzer. Each analyzer defnes the same inner class > (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his > own way of loading them or defines a large number of ctors to load stopwords > from a file, set, arrays etc.. those ctors should be removed / deprecated and > eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads
[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781869#action_12781869 ] Uwe Schindler commented on LUCENE-2075: --- {quote} And honestly I'm still tempted to do away with this cache altogether and create a "query scope", private to each query while it's running, where terms dict (and other places that need to, over time) could store stuff. That'd give a perfect within-query hit rate and wouldn't tie up any long term RAM... {quote} With Query Scope you mean a whole query, so not only a MTQ? If you combine multiple AutomatonQueries in a BooleanQuery it could also profit from the cache (as it is currently). I think until Flex, we should commit this and use the cache. When Flex is out, we may think of doing this different. > Share the Term -> TermInfo cache across threads > --- > > Key: LUCENE-2075 > URL: https://issues.apache.org/jira/browse/LUCENE-2075 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch > > > Right now each thread creates its own (thread private) SimpleLRUCache, > holding up to 1024 terms. > This is rather wasteful, since if there are a high number of threads > that come through Lucene, you're multiplying the RAM usage. You're > also cutting way back on likelihood of a cache hit (except the known > multiple times we lookup a term within-query, which uses one thread). > In NRT search we open new SegmentReaders (on tiny segments) often > which each thread must then spend CPU/RAM creating & populating. > Now that we are on 1.5 we can use java.util.concurrent.*, eg > ConcurrentHashMap. One simple approach could be a double-barrel LRU > cache, using 2 maps (primary, secondary). You check the cache by > first checking primary; if that's a miss, you check secondary and if > you get a hit you promote it to primary. Once primary is full you > clear secondary and swap them. > Or... any other suggested approach? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781870#action_12781870 ] Uwe Schindler commented on LUCENE-2034: --- bq. set svn EOF property to native - missed that in the last patch You can cofigure your SVN client to do it automatically and also add the $ID$ props. > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > - > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.9 >Reporter: Simon Willnauer >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, > LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses > need to implement at least one of the methodes returning a tokenStream. When > you look at the code it appears to be almost identical if both are > implemented in the same analyzer. Each analyzer defnes the same inner class > (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his > own way of loading them or defines a large number of ctors to load stopwords > from a file, set, arrays etc.. those ctors should be removed / deprecated and > eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781874#action_12781874 ] Robert Muir commented on LUCENE-1458: - {quote} Though it would be nice to let the codec control the sort order - eg then (I think?) the ICU/CollationKeyFilter workaround wouldn't be needed. {quote} I like this idea by the way, "flexible sorting". although i like codepoint order better than code unit order, i hate binary order in general to be honest. its nice we have 'indexable'/fast collation right now, but its maybe not what users expect either (binary keys encoded into text). > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reassigned LUCENE-2034: --- Assignee: Robert Muir > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > - > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.9 >Reporter: Simon Willnauer >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, > LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses > need to implement at least one of the methodes returning a tokenStream. When > you look at the code it appears to be almost identical if both are > implemented in the same analyzer. Each analyzer defnes the same inner class > (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his > own way of loading them or defines a large number of ctors to load stopwords > from a file, set, arrays etc.. those ctors should be removed / deprecated and > eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt
Do we need a new 3.0? (duck) - but it's fixed only at wrong position in changes. But we should also fix the 3.0 branch for 3.0.1 - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: mikemcc...@apache.org [mailto:mikemcc...@apache.org] > Sent: Tuesday, November 24, 2009 12:20 PM > To: java-comm...@lucene.apache.org > Subject: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt > > Author: mikemccand > Date: Tue Nov 24 11:19:43 2009 > New Revision: 883654 > > URL: http://svn.apache.org/viewvc?rev=883654&view=rev > Log: > LUCENE-2045: fix CHANGES entry (this was fixed in 2.9.2/3.0, not 2.9.1) > > Modified: > lucene/java/trunk/CHANGES.txt > > Modified: lucene/java/trunk/CHANGES.txt > URL: > http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?rev=883654&r1=8 > 83653&r2=883654&view=diff > == > > --- lucene/java/trunk/CHANGES.txt (original) > +++ lucene/java/trunk/CHANGES.txt Tue Nov 24 11:19:43 2009 > @@ -188,6 +188,10 @@ > * LUCENE-2088: addAttribute() should only accept interfaces that >extend Attribute. (Shai Erera, Uwe Schindler) > > +* LUCENE-2045: Fix silly FileNotFoundException hit if you enable > + infoStream on IndexWriter and then add an empty document and commit > + (Shai Erera via Mike McCandless) > + > New features > > * LUCENE-1933: Provide a convenience AttributeFactory that creates a > @@ -258,10 +262,6 @@ > char (U+FFFD) during indexing, to prevent silent index corruption. > (Peter Keegan, Mike McCandless) > > - * LUCENE-2045: Fix silly FileNotFoundException hit if you enable > - infoStream on IndexWriter and then add an empty document and commit > - (Shai Erera via Mike McCandless) > - > * LUCENE-2046: IndexReader should not see the index as changed, after > IndexWriter.prepareCommit has been called but before > IndexWriter.commit is called. (Peter Keegan via Mike McCandless) > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt
I have seen, we have the same problem with the next changes entry, 2046 :( - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Uwe Schindler [mailto:u...@thetaphi.de] > Sent: Tuesday, November 24, 2009 12:26 PM > To: java-dev@lucene.apache.org > Subject: RE: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt > > Do we need a new 3.0? (duck) - but it's fixed only at wrong position in > changes. > > But we should also fix the 3.0 branch for 3.0.1 > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -Original Message- > > From: mikemcc...@apache.org [mailto:mikemcc...@apache.org] > > Sent: Tuesday, November 24, 2009 12:20 PM > > To: java-comm...@lucene.apache.org > > Subject: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt > > > > Author: mikemccand > > Date: Tue Nov 24 11:19:43 2009 > > New Revision: 883654 > > > > URL: http://svn.apache.org/viewvc?rev=883654&view=rev > > Log: > > LUCENE-2045: fix CHANGES entry (this was fixed in 2.9.2/3.0, not 2.9.1) > > > > Modified: > > lucene/java/trunk/CHANGES.txt > > > > Modified: lucene/java/trunk/CHANGES.txt > > URL: > > > http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?rev=883654&r1=8 > > 83653&r2=883654&view=diff > > > == > > > > --- lucene/java/trunk/CHANGES.txt (original) > > +++ lucene/java/trunk/CHANGES.txt Tue Nov 24 11:19:43 2009 > > @@ -188,6 +188,10 @@ > > * LUCENE-2088: addAttribute() should only accept interfaces that > >extend Attribute. (Shai Erera, Uwe Schindler) > > > > +* LUCENE-2045: Fix silly FileNotFoundException hit if you enable > > + infoStream on IndexWriter and then add an empty document and commit > > + (Shai Erera via Mike McCandless) > > + > > New features > > > > * LUCENE-1933: Provide a convenience AttributeFactory that creates a > > @@ -258,10 +262,6 @@ > > char (U+FFFD) during indexing, to prevent silent index corruption. > > (Peter Keegan, Mike McCandless) > > > > - * LUCENE-2045: Fix silly FileNotFoundException hit if you enable > > - infoStream on IndexWriter and then add an empty document and commit > > - (Shai Erera via Mike McCandless) > > - > > * LUCENE-2046: IndexReader should not see the index as changed, after > > IndexWriter.prepareCommit has been called but before > > IndexWriter.commit is called. (Peter Keegan via Mike McCandless) > > > > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt
OK looks like you fixed LUCENE-2046 as well, and ported both fixes to 3.0.x CHANGES. I don't think this merits a 3.0.0 respin. Though I wonder if there are other issues that got incorrectly moved into 2.9.1? Mike On Tue, Nov 24, 2009 at 6:26 AM, Uwe Schindler wrote: > Do we need a new 3.0? (duck) - but it's fixed only at wrong position in > changes. > > But we should also fix the 3.0 branch for 3.0.1 > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > >> -Original Message- >> From: mikemcc...@apache.org [mailto:mikemcc...@apache.org] >> Sent: Tuesday, November 24, 2009 12:20 PM >> To: java-comm...@lucene.apache.org >> Subject: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt >> >> Author: mikemccand >> Date: Tue Nov 24 11:19:43 2009 >> New Revision: 883654 >> >> URL: http://svn.apache.org/viewvc?rev=883654&view=rev >> Log: >> LUCENE-2045: fix CHANGES entry (this was fixed in 2.9.2/3.0, not 2.9.1) >> >> Modified: >> lucene/java/trunk/CHANGES.txt >> >> Modified: lucene/java/trunk/CHANGES.txt >> URL: >> http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?rev=883654&r1=8 >> 83653&r2=883654&view=diff >> == >> >> --- lucene/java/trunk/CHANGES.txt (original) >> +++ lucene/java/trunk/CHANGES.txt Tue Nov 24 11:19:43 2009 >> @@ -188,6 +188,10 @@ >> * LUCENE-2088: addAttribute() should only accept interfaces that >> extend Attribute. (Shai Erera, Uwe Schindler) >> >> +* LUCENE-2045: Fix silly FileNotFoundException hit if you enable >> + infoStream on IndexWriter and then add an empty document and commit >> + (Shai Erera via Mike McCandless) >> + >> New features >> >> * LUCENE-1933: Provide a convenience AttributeFactory that creates a >> @@ -258,10 +262,6 @@ >> char (U+FFFD) during indexing, to prevent silent index corruption. >> (Peter Keegan, Mike McCandless) >> >> - * LUCENE-2045: Fix silly FileNotFoundException hit if you enable >> - infoStream on IndexWriter and then add an empty document and commit >> - (Shai Erera via Mike McCandless) >> - >> * LUCENE-2046: IndexReader should not see the index as changed, after >> IndexWriter.prepareCommit has been called but before >> IndexWriter.commit is called. (Peter Keegan via Mike McCandless) >> > > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt
I looked through the 2.9.1 changes and found none that was too new. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Michael McCandless [mailto:luc...@mikemccandless.com] > Sent: Tuesday, November 24, 2009 12:39 PM > To: java-dev@lucene.apache.org > Subject: Re: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt > > OK looks like you fixed LUCENE-2046 as well, and ported both fixes to > 3.0.x CHANGES. > > I don't think this merits a 3.0.0 respin. > > Though I wonder if there are other issues that got incorrectly moved into > 2.9.1? > > Mike > > On Tue, Nov 24, 2009 at 6:26 AM, Uwe Schindler wrote: > > Do we need a new 3.0? (duck) - but it's fixed only at wrong position in > > changes. > > > > But we should also fix the 3.0 branch for 3.0.1 > > > > - > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > > >> -Original Message- > >> From: mikemcc...@apache.org [mailto:mikemcc...@apache.org] > >> Sent: Tuesday, November 24, 2009 12:20 PM > >> To: java-comm...@lucene.apache.org > >> Subject: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt > >> > >> Author: mikemccand > >> Date: Tue Nov 24 11:19:43 2009 > >> New Revision: 883654 > >> > >> URL: http://svn.apache.org/viewvc?rev=883654&view=rev > >> Log: > >> LUCENE-2045: fix CHANGES entry (this was fixed in 2.9.2/3.0, not 2.9.1) > >> > >> Modified: > >> lucene/java/trunk/CHANGES.txt > >> > >> Modified: lucene/java/trunk/CHANGES.txt > >> URL: > >> > http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?rev=883654&r1=8 > >> 83653&r2=883654&view=diff > >> > == > >> > >> --- lucene/java/trunk/CHANGES.txt (original) > >> +++ lucene/java/trunk/CHANGES.txt Tue Nov 24 11:19:43 2009 > >> @@ -188,6 +188,10 @@ > >> * LUCENE-2088: addAttribute() should only accept interfaces that > >> extend Attribute. (Shai Erera, Uwe Schindler) > >> > >> +* LUCENE-2045: Fix silly FileNotFoundException hit if you enable > >> + infoStream on IndexWriter and then add an empty document and commit > >> + (Shai Erera via Mike McCandless) > >> + > >> New features > >> > >> * LUCENE-1933: Provide a convenience AttributeFactory that creates a > >> @@ -258,10 +262,6 @@ > >> char (U+FFFD) during indexing, to prevent silent index corruption. > >> (Peter Keegan, Mike McCandless) > >> > >> - * LUCENE-2045: Fix silly FileNotFoundException hit if you enable > >> - infoStream on IndexWriter and then add an empty document and commit > >> - (Shai Erera via Mike McCandless) > >> - > >> * LUCENE-2046: IndexReader should not see the index as changed, after > >> IndexWriter.prepareCommit has been called but before > >> IndexWriter.commit is called. (Peter Keegan via Mike McCandless) > >> > > > > > > > > - > > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > > > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads
[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781876#action_12781876 ] Michael McCandless commented on LUCENE-2075: bq. With Query Scope you mean a whole query, so not only a MTQ? If you combine multiple AutomatonQueries in a BooleanQuery it could also profit from the cache (as it is currently). Right, I think the top level query would open up the scope... and free it once it's done running. bq. I think until Flex, we should commit this and use the cache. When Flex is out, we may think of doing this different. OK let's go with the shared cache for now, and revisit once flex lands. I'll open a new issue... But should we drop cache to maybe 512? Typing up 4MB RAM (with cache size 1024) for a "normal" index is kinda alot... > Share the Term -> TermInfo cache across threads > --- > > Key: LUCENE-2075 > URL: https://issues.apache.org/jira/browse/LUCENE-2075 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch > > > Right now each thread creates its own (thread private) SimpleLRUCache, > holding up to 1024 terms. > This is rather wasteful, since if there are a high number of threads > that come through Lucene, you're multiplying the RAM usage. You're > also cutting way back on likelihood of a cache hit (except the known > multiple times we lookup a term within-query, which uses one thread). > In NRT search we open new SegmentReaders (on tiny segments) often > which each thread must then spend CPU/RAM creating & populating. > Now that we are on 1.5 we can use java.util.concurrent.*, eg > ConcurrentHashMap. One simple approach could be a double-barrel LRU > cache, using 2 maps (primary, secondary). You check the cache by > first checking primary; if that's a miss, you check secondary and if > you get a hit you promote it to primary. Once primary is full you > clear secondary and swap them. > Or... any other suggested approach? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads
[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781878#action_12781878 ] Michael McCandless commented on LUCENE-2075: OK I opened LUCENE-2093 to track the "query private scope" idea. > Share the Term -> TermInfo cache across threads > --- > > Key: LUCENE-2075 > URL: https://issues.apache.org/jira/browse/LUCENE-2075 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch > > > Right now each thread creates its own (thread private) SimpleLRUCache, > holding up to 1024 terms. > This is rather wasteful, since if there are a high number of threads > that come through Lucene, you're multiplying the RAM usage. You're > also cutting way back on likelihood of a cache hit (except the known > multiple times we lookup a term within-query, which uses one thread). > In NRT search we open new SegmentReaders (on tiny segments) often > which each thread must then spend CPU/RAM creating & populating. > Now that we are on 1.5 we can use java.util.concurrent.*, eg > ConcurrentHashMap. One simple approach could be a double-barrel LRU > cache, using 2 maps (primary, secondary). You check the cache by > first checking primary; if that's a miss, you check secondary and if > you get a hit you promote it to primary. Once primary is full you > clear secondary and swap them. > Or... any other suggested approach? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781877#action_12781877 ] Robert Muir commented on LUCENE-2034: - Simon in my opinion it is ok, about making tokenstream/reusablets final for those non-final contrib analyzers. i think you should make those non-final analyzers final, too. then we can get rid of complexity for sure. > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > - > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.9 >Reporter: Simon Willnauer >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, > LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses > need to implement at least one of the methodes returning a tokenStream. When > you look at the code it appears to be almost identical if both are > implemented in the same analyzer. Each analyzer defnes the same inner class > (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his > own way of loading them or defines a large number of ctors to load stopwords > from a file, set, arrays etc.. those ctors should be removed / deprecated and > eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads
[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781880#action_12781880 ] Uwe Schindler commented on LUCENE-2075: --- I would keep it as it is, because we already minimized memory requirements, because before the cache was per-thread. > Share the Term -> TermInfo cache across threads > --- > > Key: LUCENE-2075 > URL: https://issues.apache.org/jira/browse/LUCENE-2075 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch > > > Right now each thread creates its own (thread private) SimpleLRUCache, > holding up to 1024 terms. > This is rather wasteful, since if there are a high number of threads > that come through Lucene, you're multiplying the RAM usage. You're > also cutting way back on likelihood of a cache hit (except the known > multiple times we lookup a term within-query, which uses one thread). > In NRT search we open new SegmentReaders (on tiny segments) often > which each thread must then spend CPU/RAM creating & populating. > Now that we are on 1.5 we can use java.util.concurrent.*, eg > ConcurrentHashMap. One simple approach could be a double-barrel LRU > cache, using 2 maps (primary, secondary). You check the cache by > first checking primary; if that's a miss, you check secondary and if > you get a hit you promote it to primary. Once primary is full you > clear secondary and swap them. > Or... any other suggested approach? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2093) Use query-private scope instead of shared Term->TermInfo cache
Use query-private scope instead of shared Term->TermInfo cache -- Key: LUCENE-2093 URL: https://issues.apache.org/jira/browse/LUCENE-2093 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Priority: Minor Fix For: 3.1 Spinoff of LUCENE-2075. We currently use a shared terms cache so multiple resolves of the same term within execution of a single query save CPU. But this ties up a good amount of long term RAM... So, it might be better to instead create a "query private scope", where places in Lucene like the terms dict could store & retrieve results. The scope would be private to each running query, and would be GCable as soon as the query completes. Then we've have perfect within query hit rate... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads
[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781882#action_12781882 ] Robert Muir commented on LUCENE-2075: - i am still triyng to figure out the use case. bq. With Query Scope you mean a whole query, so not only a MTQ? If you combine multiple AutomatonQueries in a BooleanQuery it could also profit from the cache (as it is currently). isn't there a method i can use to force these to combine into one AutomatonQuery (I can use union, intersection, etc)? I haven't done this, but we shouldnt create a private scoped-cache for something like this? > Share the Term -> TermInfo cache across threads > --- > > Key: LUCENE-2075 > URL: https://issues.apache.org/jira/browse/LUCENE-2075 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch > > > Right now each thread creates its own (thread private) SimpleLRUCache, > holding up to 1024 terms. > This is rather wasteful, since if there are a high number of threads > that come through Lucene, you're multiplying the RAM usage. You're > also cutting way back on likelihood of a cache hit (except the known > multiple times we lookup a term within-query, which uses one thread). > In NRT search we open new SegmentReaders (on tiny segments) often > which each thread must then spend CPU/RAM creating & populating. > Now that we are on 1.5 we can use java.util.concurrent.*, eg > ConcurrentHashMap. One simple approach could be a double-barrel LRU > cache, using 2 maps (primary, secondary). You check the cache by > first checking primary; if that's a miss, you check secondary and if > you get a hit you promote it to primary. Once primary is full you > clear secondary and swap them. > Or... any other suggested approach? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
[ https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781884#action_12781884 ] Simon Willnauer commented on LUCENE-2034: - bq. i think you should make those non-final analyzers final, too. +1 I think the analyzers should always be final. Maybe there are special cases but for the most of them nobody should subclass. Same amount of work to make your own anyway. simon > Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors > - > > Key: LUCENE-2034 > URL: https://issues.apache.org/jira/browse/LUCENE-2034 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.9 >Reporter: Simon Willnauer >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, > LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt > > > Due to the variouse tokenStream APIs we had in lucene analyzer subclasses > need to implement at least one of the methodes returning a tokenStream. When > you look at the code it appears to be almost identical if both are > implemented in the same analyzer. Each analyzer defnes the same inner class > (SavedStreams) which is unnecessary. > In contrib almost every analyzer uses stopwords and each of them creates his > own way of loading them or defines a large number of ctors to load stopwords > from a file, set, arrays etc.. those ctors should be removed / deprecated and > eventually removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads
[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781885#action_12781885 ] Uwe Schindler commented on LUCENE-2075: --- ...not only AutomatonQueries can be combined, they can also be combined with other queries and then make use of the cache. > Share the Term -> TermInfo cache across threads > --- > > Key: LUCENE-2075 > URL: https://issues.apache.org/jira/browse/LUCENE-2075 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch > > > Right now each thread creates its own (thread private) SimpleLRUCache, > holding up to 1024 terms. > This is rather wasteful, since if there are a high number of threads > that come through Lucene, you're multiplying the RAM usage. You're > also cutting way back on likelihood of a cache hit (except the known > multiple times we lookup a term within-query, which uses one thread). > In NRT search we open new SegmentReaders (on tiny segments) often > which each thread must then spend CPU/RAM creating & populating. > Now that we are on 1.5 we can use java.util.concurrent.*, eg > ConcurrentHashMap. One simple approach could be a double-barrel LRU > cache, using 2 maps (primary, secondary). You check the cache by > first checking primary; if that's a miss, you check secondary and if > you get a hit you promote it to primary. Once primary is full you > clear secondary and swap them. > Or... any other suggested approach? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads
[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781886#action_12781886 ] Robert Muir commented on LUCENE-2075: - Uwe i just wonder if the cache would in practice get used much. > Share the Term -> TermInfo cache across threads > --- > > Key: LUCENE-2075 > URL: https://issues.apache.org/jira/browse/LUCENE-2075 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch > > > Right now each thread creates its own (thread private) SimpleLRUCache, > holding up to 1024 terms. > This is rather wasteful, since if there are a high number of threads > that come through Lucene, you're multiplying the RAM usage. You're > also cutting way back on likelihood of a cache hit (except the known > multiple times we lookup a term within-query, which uses one thread). > In NRT search we open new SegmentReaders (on tiny segments) often > which each thread must then spend CPU/RAM creating & populating. > Now that we are on 1.5 we can use java.util.concurrent.*, eg > ConcurrentHashMap. One simple approach could be a double-barrel LRU > cache, using 2 maps (primary, secondary). You check the cache by > first checking primary; if that's a miss, you check secondary and if > you get a hit you promote it to primary. Once primary is full you > clear secondary and swap them. > Or... any other suggested approach? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads
[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781887#action_12781887 ] Uwe Schindler commented on LUCENE-2075: --- For testing we could add two AtomicIntegers to the cache that counts hits and requests to get a hit rate, only temporary to not affect performance. > Share the Term -> TermInfo cache across threads > --- > > Key: LUCENE-2075 > URL: https://issues.apache.org/jira/browse/LUCENE-2075 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch > > > Right now each thread creates its own (thread private) SimpleLRUCache, > holding up to 1024 terms. > This is rather wasteful, since if there are a high number of threads > that come through Lucene, you're multiplying the RAM usage. You're > also cutting way back on likelihood of a cache hit (except the known > multiple times we lookup a term within-query, which uses one thread). > In NRT search we open new SegmentReaders (on tiny segments) often > which each thread must then spend CPU/RAM creating & populating. > Now that we are on 1.5 we can use java.util.concurrent.*, eg > ConcurrentHashMap. One simple approach could be a double-barrel LRU > cache, using 2 maps (primary, secondary). You check the cache by > first checking primary; if that's a miss, you check secondary and if > you get a hit you promote it to primary. Once primary is full you > clear secondary and swap them. > Or... any other suggested approach? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1606: Attachment: LUCENE-1606.patch updated patch: * don't seek to high surrogates, instead tack on \uDC00. this still works for trunk, but also with flex branch. * don't use a high surrogate prefix, instead truncate. this isn't being used at all, i would rather use 'constant suffix' * add tests that will break if lucene's sort order is not UTF-16 (or if automaton is not adjusted to the new sort order) * add another enum constructor, where you can specify smart or dumb mode yourself * regexp javadoc note * add wordage to LICENSE, not just NOTICE > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1606: Attachment: (was: LUCENE-1606.patch) > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1606: Attachment: LUCENE-1606.patch sorry, my ide added a @author tag. i need to look to see where to turn this @author generation off for eclipse. > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781899#action_12781899 ] Michael McCandless commented on LUCENE-1458: bq. i hate binary order in general to be honest. But binary order in this case is code point order. > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781904#action_12781904 ] Robert Muir commented on LUCENE-1458: - Mike, I guess I mean i'd prefer UCA order, which isn't just the order codepoints happened to randomly appear on charts, but is actually designed for sorting and ordering things :) > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads
[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781908#action_12781908 ] Michael McCandless commented on LUCENE-2075: bq. I would keep it as it is, because we already minimized memory requirements, because before the cache was per-thread. OK let's leave it at 1024, but with flex (which automaton query no longer needs the cache for), I think we should drop it and/or cutover to query-private scope. I don't think sucking up 4 MB of RAM for this rather limited purpose is warranted. I'll add a comment on LUCENE-2093. > Share the Term -> TermInfo cache across threads > --- > > Key: LUCENE-2075 > URL: https://issues.apache.org/jira/browse/LUCENE-2075 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch > > > Right now each thread creates its own (thread private) SimpleLRUCache, > holding up to 1024 terms. > This is rather wasteful, since if there are a high number of threads > that come through Lucene, you're multiplying the RAM usage. You're > also cutting way back on likelihood of a cache hit (except the known > multiple times we lookup a term within-query, which uses one thread). > In NRT search we open new SegmentReaders (on tiny segments) often > which each thread must then spend CPU/RAM creating & populating. > Now that we are on 1.5 we can use java.util.concurrent.*, eg > ConcurrentHashMap. One simple approach could be a double-barrel LRU > cache, using 2 maps (primary, secondary). You check the cache by > first checking primary; if that's a miss, you check secondary and if > you get a hit you promote it to primary. Once primary is full you > clear secondary and swap them. > Or... any other suggested approach? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2093) Use query-private scope instead of shared Term->TermInfo cache
[ https://issues.apache.org/jira/browse/LUCENE-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781910#action_12781910 ] Michael McCandless commented on LUCENE-2093: If we don't do this in 3.1, we should at least drop the size of the terms dict cache -- by rough math, that cache will consume 4 MB on a 20 segment index, even for a smallish index. When flex lands, the cache is no longer beneficial for automaton query so it need not be so large. > Use query-private scope instead of shared Term->TermInfo cache > -- > > Key: LUCENE-2093 > URL: https://issues.apache.org/jira/browse/LUCENE-2093 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Minor > Fix For: 3.1 > > > Spinoff of LUCENE-2075. > We currently use a shared terms cache so multiple resolves of the same term > within execution of a single query save CPU. But this ties up a good amount > of long term RAM... > So, it might be better to instead create a "query private scope", where > places in Lucene like the terms dict could store & retrieve results. The > scope would be private to each running query, and would be GCable as soon as > the query completes. Then we've have perfect within query hit rate... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781911#action_12781911 ] Uwe Schindler commented on LUCENE-1606: --- what is UTF-38? :-) I think you mean UTF-32, if such exists. Else it looks good! > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads
[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781913#action_12781913 ] Michael McCandless commented on LUCENE-2075: bq. Uwe i just wonder if the cache would in practice get used much. This cache (mapping Term -> TermInfo) does get used alot for "normal" atomic queries we first hit the terms dict to get the docFreq (to compute idf), then later hit it again with the exact same term, to get the TermDocs enum. So, for these queries our hit rate is 50%, but, it's rather overkill to be using a shared cache for this (query-private scope is much cleaner). EG a large automaton query running concurrently with other queries could evict entries before they read the term the 2nd time. Existing MTQs (except NRQ) which seek once and then scan to completion don't hit the cache (though, I think they do double-load each term, which is wasteful; likely this is part of the perf gains for flex). NRQ doens't do enough seeking wrt iterating/collecting the docs for the cache to make that much a difference. The upcoming automaton query should benefit however in testing we saw only the full linear scan benefit, which I'm still needing to get to the bottom of. > Share the Term -> TermInfo cache across threads > --- > > Key: LUCENE-2075 > URL: https://issues.apache.org/jira/browse/LUCENE-2075 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch > > > Right now each thread creates its own (thread private) SimpleLRUCache, > holding up to 1024 terms. > This is rather wasteful, since if there are a high number of threads > that come through Lucene, you're multiplying the RAM usage. You're > also cutting way back on likelihood of a cache hit (except the known > multiple times we lookup a term within-query, which uses one thread). > In NRT search we open new SegmentReaders (on tiny segments) often > which each thread must then spend CPU/RAM creating & populating. > Now that we are on 1.5 we can use java.util.concurrent.*, eg > ConcurrentHashMap. One simple approach could be a double-barrel LRU > cache, using 2 maps (primary, secondary). You check the cache by > first checking primary; if that's a miss, you check secondary and if > you get a hit you promote it to primary. Once primary is full you > clear secondary and swap them. > Or... any other suggested approach? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781914#action_12781914 ] Robert Muir commented on LUCENE-1606: - i think there is one last problem with this for flex branch, where you have abacadaba\uFFFC, abacadaba\uFFFD and abacadaba\uFFFE in the term dictionary, but a regex the matches say abacadaba[\uFFFC\uFFFE]. in this case, the match on abacadaba\uFFFD will fail, it will try to seek to the "next" string, which is abacadaba\uFFFE, but the FFFE will get replaced by FFFD by the byte conversion, and we will loop. mike i don't think this should be any back compat concern, unlike the high surrogate case which i think many CJK applications are probably doing to... > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781914#action_12781914 ] Robert Muir edited comment on LUCENE-1606 at 11/24/09 1:30 PM: --- i think there is one last problem with this for flex branch, where you have abacadaba\uFFFC, abacadaba\uFFFD and abacadaba\uFFFE in the term dictionary, but a regex the matches say abacadaba[\uFFFC\u]. in this case, the match on abacadaba\uFFFD will fail, it will try to seek to the "next" string, which is abacadaba\u, but the will get replaced by FFFD by the byte conversion, and we will loop. mike i don't think this should be any back compat concern, unlike the high surrogate case which i think many CJK applications are probably doing to... was (Author: rcmuir): i think there is one last problem with this for flex branch, where you have abacadaba\uFFFC, abacadaba\uFFFD and abacadaba\uFFFE in the term dictionary, but a regex the matches say abacadaba[\uFFFC\uFFFE]. in this case, the match on abacadaba\uFFFD will fail, it will try to seek to the "next" string, which is abacadaba\uFFFE, but the FFFE will get replaced by FFFD by the byte conversion, and we will loop. mike i don't think this should be any back compat concern, unlike the high surrogate case which i think many CJK applications are probably doing to... > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781915#action_12781915 ] Robert Muir commented on LUCENE-1606: - Uwe, where do you see UTF-38 :) > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads
[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781916#action_12781916 ] Robert Muir commented on LUCENE-2075: - Thanks mike, thats what I was missing hitting the terms dict twice in the common case explains it to me :) > Share the Term -> TermInfo cache across threads > --- > > Key: LUCENE-2075 > URL: https://issues.apache.org/jira/browse/LUCENE-2075 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch > > > Right now each thread creates its own (thread private) SimpleLRUCache, > holding up to 1024 terms. > This is rather wasteful, since if there are a high number of threads > that come through Lucene, you're multiplying the RAM usage. You're > also cutting way back on likelihood of a cache hit (except the known > multiple times we lookup a term within-query, which uses one thread). > In NRT search we open new SegmentReaders (on tiny segments) often > which each thread must then spend CPU/RAM creating & populating. > Now that we are on 1.5 we can use java.util.concurrent.*, eg > ConcurrentHashMap. One simple approach could be a double-barrel LRU > cache, using 2 maps (primary, secondary). You check the cache by > first checking primary; if that's a miss, you check secondary and if > you get a hit you promote it to primary. Once primary is full you > clear secondary and swap them. > Or... any other suggested approach? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781917#action_12781917 ] Michael McCandless commented on LUCENE-1458: bq. Mike, I guess I mean i'd prefer UCA order, which isn't just the order codepoints happened to randomly appear on charts, but is actually designed for sorting and ordering things Ahh, gotchya. Well if we make the sort order pluggable, you could do that... > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781922#action_12781922 ] Uwe Schindler commented on LUCENE-1606: --- bq. Uwe, where do you see UTF-38 Patch line 6025. > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order
[ https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781925#action_12781925 ] Michael McCandless commented on LUCENE-2086: Backported to 3.0.x... 2.9.x next. > When resolving deletes, IW should resolve in term sort order > > > Key: LUCENE-2086 > URL: https://issues.apache.org/jira/browse/LUCENE-2086 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2086.patch > > > See java-dev thread "IndexWriter.updateDocument performance improvement". -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781924#action_12781924 ] Uwe Schindler commented on LUCENE-1606: --- about the cleanupPrefix method: it is only used in the linear case to initially set the termenum. What happens if the nextString() method returs such a string ussed to seek the next enum? > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781923#action_12781923 ] Robert Muir commented on LUCENE-1458: - bq. Ahh, gotchya. Well if we make the sort order pluggable, you could do that... yes, then we could consider getting rid of the Collator/Locale-based range queries / sorts and things like that completely... which have performance problems. you would have a better way to do it... but if you change the sort order, any part of lucene sensitive to it might break... maybe its dangerous. maybe if we do it, it needs to be exposed properly so other components can change their behavior > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781926#action_12781926 ] Robert Muir commented on LUCENE-1606: - bq. about the cleanupPrefix method: it is only used in the linear case to initially set the termenum. What happens if the nextString() method returs such a string ussed to seek the next enum? look at the code to nextString() itself. it uses cleanSeek() which works differently. when seeking, we can append \uDC00 to achieve the same thing as seeking to a high surrogate. when using a prefix, we have to truncate the high surrogate, because we cannot use it with TermRef.startsWith() etc, it cannot be converted into UTF-8 bytes. (and we can't use the \uDC00 trick, obviously, or startsWith() will return false when it should not) > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781927#action_12781927 ] Michael McCandless commented on LUCENE-1458: Yes, this (customizing comparator for termrefs) would definitely be very advanced stuff... you'd have to create your own codec to do it. And we'd default to UTF16 sort order for back compat. > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781926#action_12781926 ] Robert Muir edited comment on LUCENE-1606 at 11/24/09 1:44 PM: --- bq. about the cleanupPrefix method: it is only used in the linear case to initially set the termenum. What happens if the nextString() method returs such a string ussed to seek the next enum? look at the code to nextString() itself. it uses cleanupPosition() which works differently. when seeking, we can append \uDC00 to achieve the same thing as seeking to a high surrogate. when using a prefix, we have to truncate the high surrogate, because we cannot use it with TermRef.startsWith() etc, it cannot be converted into UTF-8 bytes. (and we can't use the \uDC00 trick, obviously, or startsWith() will return false when it should not) was (Author: rcmuir): bq. about the cleanupPrefix method: it is only used in the linear case to initially set the termenum. What happens if the nextString() method returs such a string ussed to seek the next enum? look at the code to nextString() itself. it uses cleanSeek() which works differently. when seeking, we can append \uDC00 to achieve the same thing as seeking to a high surrogate. when using a prefix, we have to truncate the high surrogate, because we cannot use it with TermRef.startsWith() etc, it cannot be converted into UTF-8 bytes. (and we can't use the \uDC00 trick, obviously, or startsWith() will return false when it should not) > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order
[ https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-2086. Resolution: Fixed OK backported to 2.9.x. > When resolving deletes, IW should resolve in term sort order > > > Key: LUCENE-2086 > URL: https://issues.apache.org/jira/browse/LUCENE-2086 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2086.patch > > > See java-dev thread "IndexWriter.updateDocument performance improvement". -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781935#action_12781935 ] Robert Muir commented on LUCENE-1458: - bq. Yes, this (customizing comparator for termrefs) would definitely be very advanced stuff... you'd have to create your own codec to do it. And we'd default to UTF16 sort order for back compat. Agreed, changing the sort order breaks a lot of things (not just some crazy seeking around code that I write) i.e. if 'ch' is a character in some collator and sorts b, before c (completely made up example, there are real ones like this though) Then even prefixquery itself will fail! > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ---
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781938#action_12781938 ] Uwe Schindler commented on LUCENE-1458: --- ...not to talk about TermRangeQueries and NumericRangeQueries. They rely on String.compareTo like the current terms dict. > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781935#action_12781935 ] Robert Muir edited comment on LUCENE-1458 at 11/24/09 2:01 PM: --- bq. Yes, this (customizing comparator for termrefs) would definitely be very advanced stuff... you'd have to create your own codec to do it. And we'd default to UTF16 sort order for back compat. Agreed, changing the sort order breaks a lot of things (not just some crazy seeking around code that I write) i.e. if 'ch' is a character in some collator and sorts b, before c (completely made up example, there are real ones like this though) Then even prefixquery itself will fail! edit: better example is french collation, where the weight of accent marks is done in reverse order. prefix query would make assumptions based on the prefix, which are wrong. was (Author: rcmuir): bq. Yes, this (customizing comparator for termrefs) would definitely be very advanced stuff... you'd have to create your own codec to do it. And we'd default to UTF16 sort order for back compat. Agreed, changing the sort order breaks a lot of things (not just some crazy seeking around code that I write) i.e. if 'ch' is a character in some collator and sorts b, before c (completely made up example, there are real ones like this though) Then even prefixquery itself will fail! > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in
[jira] Created: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
Prepare CharArraySet for Unicode 4.0 Key: LUCENE-2094 URL: https://issues.apache.org/jira/browse/LUCENE-2094 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.9.1, 2.9, 2.4.1, 2.4, 2.3.2, 2.3.1, 2.3, 2.2, 2.1, 2.0.0, 1.9, 2.3.3, 2.4.2, 2.9.2, 3.0, 3.0.1, 3.1 Reporter: Simon Willnauer Fix For: 3.1 CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2094: Attachment: LUCENE-2094.txt This patch contains a testcase and a fixed CharArraySet. Yet this does not use Version to preserve compatibility. I bring this patch up to start the discussion how we should handle this particular case. Using version would not be that much of an issue as all Analyzers using a CharArraySet do have the Version class already. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781947#action_12781947 ] DM Smith commented on LUCENE-1458: -- bq. Yes, this (customizing comparator for termrefs) would definitely be very advanced stuff... you'd have to create your own codec to do it. And we'd default to UTF16 sort order for back compat. For those of us working on texts in all different kinds of languages, it should not be very advanced stuff. It should be stock Lucene. A default UCA comparator would be good. And a way to provide a locale sensitive UCA comparator would also be good. My use case is that each Lucene index typically has a single language or at least has a dominant language. bq. ...not to talk about TermRangeQueries and NumericRangeQueries. They rely on String.compareTo like the current terms dict. I think that String.compareTo works correctly on UCA collation keys. > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781950#action_12781950 ] Robert Muir commented on LUCENE-2094: - Hi simon, at a glance your patch is ok. I wonder though if we should try to consistently improve both this and LowerCaseFilter patch in the same way. i have two ideas that might make it easier...? I am very inconsistent with these things myself so I guess we can try to make it consistent. 1. {code} for(int i=0;i= Character.MIN_SUPPLEMENTARY_CODE_POINT){ ++i; } } {code} I wonder if instead loops like this should look like {code} for (int i =0; i < len; ) { ... i += Character.charCount(codepoint); } {code} 2. I wonder if we should even add an if (supplementary) for things like lowercasing. toLowerCase(ch) and toLowerCase(int) are most likely the same code anyway, so we could just make the code easier to read. {code} for (int i = 0; i < len; ) { i += Character.toChars(arr, ... Character.toLowerCase( Character.codePointAt(...))) } {code} > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781953#action_12781953 ] Robert Muir commented on LUCENE-1458: - bq. I think that String.compareTo works correctly on UCA collation keys. No, because UCA collation keys are bytes :) You are right that byte comparison on these keys works though. But if we change the sort order like this, various components are not looking at keys, instead they are looking at the term text themselves. I guess what I am saying is that there is a lot of assumptions in lucene right now, (prefixquery was my example) that look at term text and assume it is sorted in binary order. bq. It should be stock Lucene as much as I agree with you that default UCA should be "stock lucene" (with the capability to use an alternate locale or even tailored collator), this creates some practical problems, as mentioned above. also the practical problem that collation in the JDK is poop and we would want ICU for good performance... > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexib
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781959#action_12781959 ] Simon Willnauer commented on LUCENE-2094: - Robert, I tried to make it consistent to the LowerCaseFilter issues but I would vote +1 for both! This makes it much cleaner but we need to change the LowerCaseFilter one too! I will quickly change my patch. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781958#action_12781958 ] Uwe Schindler commented on LUCENE-2094: --- Maybe we put this into UnicodeUtils (handling of toLowerCase etc for char[]). > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781960#action_12781960 ] Simon Willnauer commented on LUCENE-2094: - bq. Maybe we put this into UnicodeUtils (handling of toLowerCase etc for char[]). I think calling those 3 methods should be fine without a utils method. We will see how it goes until the "end" of this whole issues I might change my mind. simon > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781962#action_12781962 ] Robert Muir commented on LUCENE-2094: - Simon definitely, it is not a problem with your patch... Thinking we can fix both to be clean. btw, I have no idea if there is any performance difference between doing things this way. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2094: Attachment: LUCENE-2094.txt Changed loop to use Charater.charCount() > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt, LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781964#action_12781964 ] Simon Willnauer commented on LUCENE-2094: - bq. btw, I have no idea if there is any performance difference between doing things this way. The change to charCount is pretty much the same as the if statement - this at least would not kill any performance. The increment by 2 should also not be an issue. it is slightly slower than a ++ but this will be fine I guess. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt, LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781965#action_12781965 ] Robert Muir commented on LUCENE-2094: - simon yeah, I guess what I don't know, is if in the JDK Character.foo(int) is the same underlying stuff as Character.foo(char) in trunk ICU there is not even char-based methods, it is all int, where its a trie lookup, with a special fast-path array for linear access to Latin-1 > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt, LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781969#action_12781969 ] Simon Willnauer commented on LUCENE-2094: - bq. I guess what I don't know, is if in the JDK Character.foo(int) is the same underlying stuff as Character.foo(char) The JDK version of toLowerCase(char) for instance casts to int and calls the overloaded method. public static boolean isLowerCase(char ch) { return isLowerCase((int)ch); } That is the case all over the place as far as I can see. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt, LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781969#action_12781969 ] Simon Willnauer edited comment on LUCENE-2094 at 11/24/09 3:00 PM: --- bq. I guess what I don't know, is if in the JDK Character.foo(int) is the same underlying stuff as Character.foo(char) The JDK version of toLowerCase(char) for instance casts to int and calls the overloaded method. {code} public static boolean isLowerCase(char ch) { return isLowerCase((int)ch); } {code} That is the case all over the place as far as I can see. was (Author: simonw): bq. I guess what I don't know, is if in the JDK Character.foo(int) is the same underlying stuff as Character.foo(char) The JDK version of toLowerCase(char) for instance casts to int and calls the overloaded method. public static boolean isLowerCase(char ch) { return isLowerCase((int)ch); } That is the case all over the place as far as I can see. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt, LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781971#action_12781971 ] Robert Muir commented on LUCENE-2094: - Simon, yeah i just checked. all the properties, behind the scenes are stored as int. we shouldn't use any char-based methods pretending it will buy us any faster performance. it will just make the code ugly and probably slower. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt, LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781971#action_12781971 ] Robert Muir edited comment on LUCENE-2094 at 11/24/09 3:09 PM: --- Simon, yeah i just checked. all the properties, behind the scenes are stored as int. we shouldn't use any char-based methods pretending it will buy us any faster performance. it will just make the code ugly and probably slower. slower meaning, the "if" itself in the lowercasefilter patch, it can now be removed. was (Author: rcmuir): Simon, yeah i just checked. all the properties, behind the scenes are stored as int. we shouldn't use any char-based methods pretending it will buy us any faster performance. it will just make the code ugly and probably slower. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt, LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2094: Attachment: LUCENE-2094.txt Added some more tests including single highsurrogate chars. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781980#action_12781980 ] Simon Willnauer commented on LUCENE-2094: - question of the day - should we use Version or not :) > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781983#action_12781983 ] Uwe Schindler commented on LUCENE-2094: --- It would not hurt, the Set is only used for analyzers that all take a version param... It is not really a public API. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2092) BooleanQuery.hashCode and equals ignore isCoordDisabled
[ https://issues.apache.org/jira/browse/LUCENE-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-2092. Resolution: Fixed Fixed in trunk, 3.0.x branch, 2.9.x branch. Thanks Hoss! > BooleanQuery.hashCode and equals ignore isCoordDisabled > --- > > Key: LUCENE-2092 > URL: https://issues.apache.org/jira/browse/LUCENE-2092 > Project: Lucene - Java > Issue Type: Bug > Components: Query/Scoring >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, > 2.9, 2.9.1 >Reporter: Hoss Man >Assignee: Michael McCandless > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2092.patch > > > BooleanQuery.isCoordDisabled() is not considered by BooleanQuery's hashCode() > or equals() methods ... this can cause serious badness to happen when caching > BooleanQueries. > bug traces back to at least 1.9 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781994#action_12781994 ] Simon Willnauer commented on LUCENE-2094: - bq. It would not hurt, the Set is only used for analyzers that all take a version param... It is not really a public API. So the thing here is that lowercasing for supplementary characters does only apply to a hand ful of chars see this link http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3ACase_Sensitive%3DTrue%3A]%26[^[\u-\u]]]&esc=on Those characters are from the Deseret Alphabet (mormons) which means we are introducing a "pain in the neck" Version flag into CharArraySet for about 40 chars which would be broken?! I don't see this here! Nothing personal related to the Deseret Alphabet or anyone who is using it but this seem a bit too much of a hassle. It would make the code very ugly though. simon > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781995#action_12781995 ] Robert Muir commented on LUCENE-2094: - Another option would be to list a back break in changes: if you are indexing Deseret language, you should reindex. we could remove the Version from LowerCaseFilter this way, too. If you are indexing this language, things werent working right before so you surely wrote your own filters...?! > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781998#action_12781998 ] Simon Willnauer commented on LUCENE-2094: - I would also break compat in LowerCaseFilter and bring out a large NOTE that if you index mormon you need to reindex. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782007#action_12782007 ] Uwe Schindler commented on LUCENE-2094: --- +1 for breaking backwards for these chars. From the web: there are only 4 books written in this charset (the books of mormon, see [http://en.wikipedia.org/wiki/Deseret_alphabet], [http://www.omniglot.com/writing/deseret.htm]), so it is rather seldom. People affected by this will for sure have their own analyzers. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782009#action_12782009 ] Robert Muir commented on LUCENE-2094: - Simon, yeah. its tricky you know, like many suppl. char issues. even if we provide perfect backwards compatibility with what 3.0 did, if you care about these languages, you *WANT* to reindex, because stuff wasn't working at all before. and if you really care, you weren't using any of lucene's analysis components anyway (except maybe WhitespaceTokenizer). For example, StandardAnalyzer currently discards these characters anyway. but we don't want to screw over CJK users where things might have been "mostly" working before, either. In this case, CJK is completely unaffected, I think we should not use version here or in any other lowercasing fixes, including LowerCaseFilter itself. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer reassigned LUCENE-2039: --- Assignee: Simon Willnauer (was: Grant Ingersoll) Took over from Grant > Regex support and beyond in JavaCC QueryParser > -- > > Key: LUCENE-2039 > URL: https://issues.apache.org/jira/browse/LUCENE-2039 > Project: Lucene - Java > Issue Type: Improvement > Components: QueryParser >Reporter: Simon Willnauer >Assignee: Simon Willnauer >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch, > LUCENE-2039_field_ext.patch, LUCENE-2039_field_ext.patch > > > Since the early days the standard query parser was limited to the queries > living in core, adding other queries or extending the parser in any way > always forced people to change the grammar file and regenerate. Even if you > change the grammar you have to be extremely careful how you modify the parser > so that other parts of the standard parser are affected by customisation > changes. Eventually you had to live with all the limitation the current > parser has like tokenizing on whitespaces before a tokenizer / analyzer has > the chance to look at the tokens. > I was thinking about how to overcome the limitation and add regex support to > the query parser without introducing any dependency to core. I added a new > special character that basically prevents the parser from interpreting any of > the characters enclosed in the new special characters. I choose the forward > slash '/' as the delimiter so that everything in between two forward slashes > is basically escaped and ignored by the parser. All chars embedded within > forward slashes are treated as one token even if it contains other special > chars like * []?{} or whitespaces. This token is subsequently passed to a > pluggable "parser extension" with builds a query from the embedded string. I > do not interpret the embedded string in any way but leave all the subsequent > work to the parser extension. Such an extension could be another full > featured query parser itself or simply a ctor call for regex query. The > interface remains quiet simple but makes the parser extendible in an easy way > compared to modifying the javaCC sources. > The downsides of this patch is clearly that I introduce a new special char > into the syntax but I guess that would not be that much of a deal as it is > reflected in the escape method though. It would truly be nice to have more > than once extension an have this even more flexible so treat this patch as a > kickoff though. > Another way of solving the problem with RegexQuery would be to move the JDK > version of regex into the core and simply have another method like: > {code} > protected Query newRegexQuery(Term t) { > ... > } > {code} > which I would like better as it would be more consistent with the idea of the > query parser to be a very strict and defined parser. > I will upload a patch in a second which implements the extension based > approach I guess I will add a second patch with regex in core soon too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782015#action_12782015 ] Robert Muir commented on LUCENE-1458: - {quote} So this is definitely a back compat problem. And, unfortunately, even if we like the true codepoint sort order, it's not easy to switch to in a back-compat manner because if we write new segments into an old index, SegmentMerger will be in big trouble when it tries to merge two segments that had sorted the terms differently. {quote} Mike, I think it goes well beyond this. I think sort order is an exceptional low-level case that can trickle all the way up high into the application layer (including user perception itself), and create bugs. Does a non-technical user in Hong Kong know how many codepoints each ideograph they enter are? Should they care? They will just not understand if things are in different order. I think we are stuck with UTF-16 without a huge effort, which would not be worth it in any case. > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (
[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782015#action_12782015 ] Robert Muir edited comment on LUCENE-1458 at 11/24/09 4:37 PM: --- {quote} So this is definitely a back compat problem. And, unfortunately, even if we like the true codepoint sort order, it's not easy to switch to in a back-compat manner because if we write new segments into an old index, SegmentMerger will be in big trouble when it tries to merge two segments that had sorted the terms differently. {quote} Mike, I think it goes well beyond this. I think sort order is an exceptional low-level case that can trickle all the way up high into the application layer (including user perception itself), and create bugs. Does a non-technical user in Hong Kong know how many code units each ideograph they enter are? Should they care? They will just not understand if things are in different order. I think we are stuck with UTF-16 without a huge effort, which would not be worth it in any case. was (Author: rcmuir): {quote} So this is definitely a back compat problem. And, unfortunately, even if we like the true codepoint sort order, it's not easy to switch to in a back-compat manner because if we write new segments into an old index, SegmentMerger will be in big trouble when it tries to merge two segments that had sorted the terms differently. {quote} Mike, I think it goes well beyond this. I think sort order is an exceptional low-level case that can trickle all the way up high into the application layer (including user perception itself), and create bugs. Does a non-technical user in Hong Kong know how many codepoints each ideograph they enter are? Should they care? They will just not understand if things are in different order. I think we are stuck with UTF-16 without a huge effort, which would not be worth it in any case. > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} >
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782024#action_12782024 ] Robert Muir commented on LUCENE-1606: - bq. Patch line 6025. Thanks for reviewing the patch and catching this. I'm working on trying to finalize this. It already works fine for trunk, but I don't want it to suddenly break with the flex branch, so I'm adding a lot of tests and improvements in that regard. The current wildcard tests aren't sufficient anyway to tell if its really working. Also, when Mike ported it to the flex branch, he reorganized some code some in a way that I think is better, so I want to tie that in too. > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782026#action_12782026 ] Uwe Schindler commented on LUCENE-1606: --- Did he changed the FilteredTermEnum.next() loops? if yes, maybe the better approach also works for NRQ. I am just interested, but had no time to thoroughly look into the latest changes. > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782026#action_12782026 ] Uwe Schindler edited comment on LUCENE-1606 at 11/24/09 5:09 PM: - Did he changed the FilteredTermEnum.next() loops? if yes, maybe the better approach also works for NRQ. I am just interested, but had no time to thoroughly look into the latest changes. I am still thinking about an extension of FilteredTermEnum that works with these repositioning out of the box. But I have no good idea. The work in FilteredTerm*s*Enum is a good start, but may be extended, to also support something like a return value "JUMP_TO_NEXT_ENUM" and a mabstract method "nextEnum()" that returns null per default (no further enum). was (Author: thetaphi): Did he changed the FilteredTermEnum.next() loops? if yes, maybe the better approach also works for NRQ. I am just interested, but had no time to thoroughly look into the latest changes. > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782029#action_12782029 ] Robert Muir commented on LUCENE-1606: - No, the main thing he did here that i like better, is that instead of caching the last comparison in termCompare(), he uses a boolean 'first' This still gives the optimization of 'don't seek in the term dictionary unless you get a mismatch, as long as you have matches, read sequentially' But in my opinion, its cleaner. > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782033#action_12782033 ] Uwe Schindler commented on LUCENE-1606: --- OK, so doesn't affect NRQ, as it uses a different algo > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads
[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782035#action_12782035 ] Michael McCandless commented on LUCENE-2075: {quote} bq. I am quite sure that also Robert's test is random (as he explained). It's not random - it's the specified pattern, parsed to WildcardQuery, run 10 times, then take best or avg time. {quote} Woops -- I was wrong here -- Robert's test is random: on each iteration, it replaces any N's in the pattern w/ a random number 0-9. Still baffled on why the linear scan shows gains w/ the cache... digging. > Share the Term -> TermInfo cache across threads > --- > > Key: LUCENE-2075 > URL: https://issues.apache.org/jira/browse/LUCENE-2075 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch > > > Right now each thread creates its own (thread private) SimpleLRUCache, > holding up to 1024 terms. > This is rather wasteful, since if there are a high number of threads > that come through Lucene, you're multiplying the RAM usage. You're > also cutting way back on likelihood of a cache hit (except the known > multiple times we lookup a term within-query, which uses one thread). > In NRT search we open new SegmentReaders (on tiny segments) often > which each thread must then spend CPU/RAM creating & populating. > Now that we are on 1.5 we can use java.util.concurrent.*, eg > ConcurrentHashMap. One simple approach could be a double-barrel LRU > cache, using 2 maps (primary, secondary). You check the cache by > first checking primary; if that's a miss, you check secondary and if > you get a hit you promote it to primary. Once primary is full you > clear secondary and swap them. > Or... any other suggested approach? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782036#action_12782036 ] Robert Muir commented on LUCENE-1606: - Yeah, but in general I think I already agree that FilteredTerm*s*Enum is easier for stuff like this. Either way its still tricky to make enums like this, so I am glad you are looking into it. > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads
[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782039#action_12782039 ] Robert Muir commented on LUCENE-2075: - bq. Woops - I was wrong here - Robert's test is random: on each iteration, it replaces any N's in the pattern w/ a random number 0-9. Yeah, the terms are equally distributed 000-999 though, just a "fill" The wildcard patterns themselves are filled with random numbers. This is my basis for the new wildcard test btw, except maybe 1-10k, definitely want over 8192 :) unless you have better ideas? > Share the Term -> TermInfo cache across threads > --- > > Key: LUCENE-2075 > URL: https://issues.apache.org/jira/browse/LUCENE-2075 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch > > > Right now each thread creates its own (thread private) SimpleLRUCache, > holding up to 1024 terms. > This is rather wasteful, since if there are a high number of threads > that come through Lucene, you're multiplying the RAM usage. You're > also cutting way back on likelihood of a cache hit (except the known > multiple times we lookup a term within-query, which uses one thread). > In NRT search we open new SegmentReaders (on tiny segments) often > which each thread must then spend CPU/RAM creating & populating. > Now that we are on 1.5 we can use java.util.concurrent.*, eg > ConcurrentHashMap. One simple approach could be a double-barrel LRU > cache, using 2 maps (primary, secondary). You check the cache by > first checking primary; if that's a miss, you check secondary and if > you get a hit you promote it to primary. Once primary is full you > clear secondary and swap them. > Or... any other suggested approach? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782042#action_12782042 ] Uwe Schindler commented on LUCENE-1606: --- I think the approach with nextEnum() would work for Automaton and NRQ, because both use this iteration approach. You have nextString() for repositioning, and I have a LinkedList (a stack) of pre-sorted range bounds. > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782043#action_12782043 ] Robert Muir commented on LUCENE-1606: - And I could still use this with "dumb mode"?, just one enum, right? > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Hudson Account for me refused
Hi others, I was trying to get an account for hudson, but it was refused: https://issues.apache.org/jira/browse/INFRA-2326 As far as I know, other only-committers of Lucene-Java have already one, so what should I do? If somebody with Hudson account would at least beable to change the build properties to use version 3.1-dev. The nightly svn target was already changed. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads
[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782046#action_12782046 ] Michael McCandless commented on LUCENE-2075: bq. This is my basis for the new wildcard test btw, except maybe 1-10k, definitely want over 8192 Sounds great :) > Share the Term -> TermInfo cache across threads > --- > > Key: LUCENE-2075 > URL: https://issues.apache.org/jira/browse/LUCENE-2075 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, > LUCENE-2075.patch > > > Right now each thread creates its own (thread private) SimpleLRUCache, > holding up to 1024 terms. > This is rather wasteful, since if there are a high number of threads > that come through Lucene, you're multiplying the RAM usage. You're > also cutting way back on likelihood of a cache hit (except the known > multiple times we lookup a term within-query, which uses one thread). > In NRT search we open new SegmentReaders (on tiny segments) often > which each thread must then spend CPU/RAM creating & populating. > Now that we are on 1.5 we can use java.util.concurrent.*, eg > ConcurrentHashMap. One simple approach could be a double-barrel LRU > cache, using 2 maps (primary, secondary). You check the cache by > first checking primary; if that's a miss, you check secondary and if > you get a hit you promote it to primary. Once primary is full you > clear secondary and swap them. > Or... any other suggested approach? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782056#action_12782056 ] Uwe Schindler commented on LUCENE-1606: --- yes. > Automaton Query/Filter (scalable regex) > --- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)
Hi Uwe, On Sun, 22 Nov 2009, Uwe Schindler wrote: I have built the artifacts for the final release of "Apache Lucene Java 3.0.0" a second time, because of a bug in the TokenStream API (found by Shai Erera, who wanted to make "bad" things with addAttribute, breaking its behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to prevent stack overflow, LUCENE-2087). They are targeted for release on 2009-11-25. The artifacts are here: http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/ The artifacts you've prepared don't correspond to the HEAD of the lucene_3_0 branch anymore since fixes for bugs 2086 and 2092 were added. Could you please add a lucene_3_0_0 tag that corresponds to the artifacts ? This makes it easier to build a PyLucene with Lucene Java sources equivalent to these artifacts, using Lucene Java's svn. Of course, if another revision of these artifacts ends up being made, the tag should then move accordingly but, at this point, it's just missing. Thanks ! Andi.. You find the changes in the corresponding sub folder. The SVN revision is 883080, here the manifest with build system info: Manifest-Version: 1.0 Ant-Version: Apache Ant 1.7.0 Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.) Specification-Title: Lucene Search Engine Specification-Version: 3.0.0 Specification-Vendor: The Apache Software Foundation Implementation-Title: org.apache.lucene Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49 Implementation-Vendor: The Apache Software Foundation X-Compile-Source-JDK: 1.5 X-Compile-Target-JDK: 1.5 Please vote to officially release these artifacts as "Apache Lucene Java 3.0.0". We need at least 3 binding (PMC) votes. Thanks everyone for all their hard work on this and I am very sorry for requesting a vote again, but that's life! Thanks Shai for the pointer to the bug! Here is the proposed release note, please edit, if needed: -- Hello Lucene users, On behalf of the Lucene dev community (a growing community far larger than just the committers) I would like to announce the release of Lucene Java 3.0: The new version is mostly a cleanup release without any new features. All deprecations targeted to be removed in version 3.0 were removed. If you are upgrading from version 2.9.1 of Lucene, you have to fix all deprecation warnings in your code base to be able to recompile against this version. This is the first Lucene release with Java 5 as a minimum requirement. The API was cleaned up to make use of Java 5's generics, varargs, enums, and autoboxing. New users of Lucene are advised to use this version for new developments, because it has a clean, type safe new API. Upgrading users can now remove unnecessary casts and add generics to their code, too. If you have not upgraded your installation to Java 5, please read the file JRE_VERSION_MIGRATION.txt (please note that this is not related to Lucene 3.0, it will also happen with any previous release when you upgrade your Java environment). Lucene 3.0 has some changes regarding compressed fields: 2.9 already deprecated compressed fields; support for them was removed now. Lucene 3.0 is still able to read indexes with compressed fields, but as soon as merges occur or the index is optimized, all compressed fields are decompressed and converted to Field.Store.YES. Because of this, indexes with compressed fields can suddenly get larger. While we generally try and maintain full backwards compatibility between major versions, Lucene 3.0 has some minor breaks, mostly related to deprecation removal, pointed out in the 'Changes in backwards compatibility policy' section of CHANGES.txt. Notable are: - IndexReader.open(Directory) now opens in read-only mode per default (this method was deprecated because of that in 2.9). The same occurs to IndexSearcher. - Already started in 2.9, core TokenStreams are now made final to enforce the decorator pattern. - If you interrupt an IndexWriter merge thread, IndexWriter now throws an unchecked ThreadInterruptedException that extends RuntimeException and clears the interrupt status. -- Thanks, Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Hudson Account for me refused
Here is the rationale for that: http://mail-archives.apache.org/mod_mbox/www-builds/200911.mbox/%3c4B0C1563. 6050...@apache.org%3e - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Uwe Schindler [mailto:u...@thetaphi.de] > Sent: Tuesday, November 24, 2009 6:32 PM > To: java-dev@lucene.apache.org > Subject: Hudson Account for me refused > > Hi others, > > I was trying to get an account for hudson, but it was refused: > https://issues.apache.org/jira/browse/INFRA-2326 > > As far as I know, other only-committers of Lucene-Java have already one, > so > what should I do? > > If somebody with Hudson account would at least beable to change the build > properties to use version 3.1-dev. The nightly svn target was already > changed. > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Hudson Account for me refused
Yeah, I've seen these rejections before - I don't think the rule makes any sense, but they only given Hudson accounts to PMC members. Uwe Schindler wrote: > Here is the rationale for that: > > http://mail-archives.apache.org/mod_mbox/www-builds/200911.mbox/%3c4B0C1563. > 6050...@apache.org%3e > > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > >> -Original Message- >> From: Uwe Schindler [mailto:u...@thetaphi.de] >> Sent: Tuesday, November 24, 2009 6:32 PM >> To: java-dev@lucene.apache.org >> Subject: Hudson Account for me refused >> >> Hi others, >> >> I was trying to get an account for hudson, but it was refused: >> https://issues.apache.org/jira/browse/INFRA-2326 >> >> As far as I know, other only-committers of Lucene-Java have already one, >> so >> what should I do? >> >> If somebody with Hudson account would at least beable to change the build >> properties to use version 3.1-dev. The nightly svn target was already >> changed. >> >> Uwe >> >> - >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: u...@thetaphi.de >> >> >> >> >> - >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> > > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)
Hi Andi I will add the tag, when it is officially voted for release. If we respin, the tag would be incorrect (and must be removed and recreated). The release todo clearly says, that the tag should be added when all votes are there, and all other did this like this before. Just one more day and I will create the tag (if I get 2 more votes). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Andi Vajda [mailto:va...@osafoundation.org] > Sent: Tuesday, November 24, 2009 6:46 PM > To: java-dev@lucene.apache.org > Subject: Re: [VOTE] Release Apache Lucene Java 3.0.0 (take #2) > > > Hi Uwe, > > On Sun, 22 Nov 2009, Uwe Schindler wrote: > > > I have built the artifacts for the final release of "Apache Lucene Java > > 3.0.0" a second time, because of a bug in the TokenStream API (found by > Shai > > Erera, who wanted to make "bad" things with addAttribute, breaking its > > behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to > prevent > > stack overflow, LUCENE-2087). They are targeted for release on 2009-11- > 25. > > > > The artifacts are here: > > http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/ > > The artifacts you've prepared don't correspond to the HEAD of the > lucene_3_0 branch anymore since fixes for bugs 2086 and 2092 were added. > > Could you please add a lucene_3_0_0 tag that corresponds to the artifacts > ? > This makes it easier to build a PyLucene with Lucene Java sources > equivalent > to these artifacts, using Lucene Java's svn. > > Of course, if another revision of these artifacts ends up being made, the > tag should then move accordingly but, at this point, it's just missing. > > Thanks ! > > Andi.. > > > > > You find the changes in the corresponding sub folder. The SVN revision > is > > 883080, here the manifest with build system info: > > > > Manifest-Version: 1.0 > > Ant-Version: Apache Ant 1.7.0 > > Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.) > > Specification-Title: Lucene Search Engine > > Specification-Version: 3.0.0 > > Specification-Vendor: The Apache Software Foundation > > Implementation-Title: org.apache.lucene > > Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49 > > Implementation-Vendor: The Apache Software Foundation > > X-Compile-Source-JDK: 1.5 > > X-Compile-Target-JDK: 1.5 > > > > Please vote to officially release these artifacts as "Apache Lucene Java > > 3.0.0". > > > > We need at least 3 binding (PMC) votes. > > > > Thanks everyone for all their hard work on this and I am very sorry for > > requesting a vote again, but that's life! Thanks Shai for the pointer to > the > > bug! > > > > > > > > > > Here is the proposed release note, please edit, if needed: > > > -- > > > > Hello Lucene users, > > > > On behalf of the Lucene dev community (a growing community far larger > than > > just the committers) I would like to announce the release of Lucene Java > > 3.0: > > > > The new version is mostly a cleanup release without any new features. > All > > deprecations targeted to be removed in version 3.0 were removed. If you > are > > upgrading from version 2.9.1 of Lucene, you have to fix all deprecation > > warnings in your code base to be able to recompile against this version. > > > > This is the first Lucene release with Java 5 as a minimum requirement. > The > > API was cleaned up to make use of Java 5's generics, varargs, enums, and > > autoboxing. New users of Lucene are advised to use this version for new > > developments, because it has a clean, type safe new API. Upgrading users > can > > now remove unnecessary casts and add generics to their code, too. If you > > have not upgraded your installation to Java 5, please read the file > > JRE_VERSION_MIGRATION.txt (please note that this is not related to > Lucene > > 3.0, it will also happen with any previous release when you upgrade your > > Java environment). > > > > Lucene 3.0 has some changes regarding compressed fields: 2.9 already > > deprecated compressed fields; support for them was removed now. Lucene > 3.0 > > is still able to read indexes with compressed fields, but as soon as > merges > > occur or the index is optimized, all compressed fields are decompressed > and > > converted to Field.Store.YES. Because of this, indexes with compressed > > fields can suddenly get larger. > > > > While we generally try and maintain full backwards compatibility between > > major versions, Lucene 3.0 has some minor breaks, mostly related to > > deprecation removal, pointed out in the 'Changes in backwards > compatibility > > policy' section of CHANGES.txt. Notable are: > > > > - IndexReader.open(Directory) now opens in read-only mode per default > (this > > method was deprecated because of that in 2.9). The same occurs to > > IndexSearcher. > > > > - Already started in 2.9, core TokenStreams are now
RE: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)
On Tue, 24 Nov 2009, Uwe Schindler wrote: I will add the tag, when it is officially voted for release. If we respin, the tag would be incorrect (and must be removed and recreated). The release todo clearly says, that the tag should be added when all votes are there, and all other did this like this before. Just one more day and I will create the tag (if I get 2 more votes). So I'm in a catch-22. I was going to vote if I could build a PyLucene from this and pass all PyLucene tests :) Do you happen to know what svn rev was used to build the artifacts ? I could use that rev instead of HEAD. Andi.. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Andi Vajda [mailto:va...@osafoundation.org] Sent: Tuesday, November 24, 2009 6:46 PM To: java-dev@lucene.apache.org Subject: Re: [VOTE] Release Apache Lucene Java 3.0.0 (take #2) Hi Uwe, On Sun, 22 Nov 2009, Uwe Schindler wrote: I have built the artifacts for the final release of "Apache Lucene Java 3.0.0" a second time, because of a bug in the TokenStream API (found by Shai Erera, who wanted to make "bad" things with addAttribute, breaking its behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to prevent stack overflow, LUCENE-2087). They are targeted for release on 2009-11- 25. The artifacts are here: http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/ The artifacts you've prepared don't correspond to the HEAD of the lucene_3_0 branch anymore since fixes for bugs 2086 and 2092 were added. Could you please add a lucene_3_0_0 tag that corresponds to the artifacts ? This makes it easier to build a PyLucene with Lucene Java sources equivalent to these artifacts, using Lucene Java's svn. Of course, if another revision of these artifacts ends up being made, the tag should then move accordingly but, at this point, it's just missing. Thanks ! Andi.. You find the changes in the corresponding sub folder. The SVN revision is 883080, here the manifest with build system info: Manifest-Version: 1.0 Ant-Version: Apache Ant 1.7.0 Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.) Specification-Title: Lucene Search Engine Specification-Version: 3.0.0 Specification-Vendor: The Apache Software Foundation Implementation-Title: org.apache.lucene Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49 Implementation-Vendor: The Apache Software Foundation X-Compile-Source-JDK: 1.5 X-Compile-Target-JDK: 1.5 Please vote to officially release these artifacts as "Apache Lucene Java 3.0.0". We need at least 3 binding (PMC) votes. Thanks everyone for all their hard work on this and I am very sorry for requesting a vote again, but that's life! Thanks Shai for the pointer to the bug! Here is the proposed release note, please edit, if needed: -- Hello Lucene users, On behalf of the Lucene dev community (a growing community far larger than just the committers) I would like to announce the release of Lucene Java 3.0: The new version is mostly a cleanup release without any new features. All deprecations targeted to be removed in version 3.0 were removed. If you are upgrading from version 2.9.1 of Lucene, you have to fix all deprecation warnings in your code base to be able to recompile against this version. This is the first Lucene release with Java 5 as a minimum requirement. The API was cleaned up to make use of Java 5's generics, varargs, enums, and autoboxing. New users of Lucene are advised to use this version for new developments, because it has a clean, type safe new API. Upgrading users can now remove unnecessary casts and add generics to their code, too. If you have not upgraded your installation to Java 5, please read the file JRE_VERSION_MIGRATION.txt (please note that this is not related to Lucene 3.0, it will also happen with any previous release when you upgrade your Java environment). Lucene 3.0 has some changes regarding compressed fields: 2.9 already deprecated compressed fields; support for them was removed now. Lucene 3.0 is still able to read indexes with compressed fields, but as soon as merges occur or the index is optimized, all compressed fields are decompressed and converted to Field.Store.YES. Because of this, indexes with compressed fields can suddenly get larger. While we generally try and maintain full backwards compatibility between major versions, Lucene 3.0 has some minor breaks, mostly related to deprecation removal, pointed out in the 'Changes in backwards compatibility policy' section of CHANGES.txt. Notable are: - IndexReader.open(Directory) now opens in read-only mode per default (this method was deprecated because of that in 2.9). The same occurs to IndexSearcher. - Already started in 2.9, core TokenStreams are now made final to enforce the decorator pattern. - If you interrupt an I