date:20091124

[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order

2009-11-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781837#action_12781837
 ] 

Michael McCandless commented on LUCENE-2086:


{quote}
I did'nt want to have commits in 3.0, because if I respin a release, I would 
not be able to only take some of the fixes into 3.0.0. That was the reason.

Can you put this also in 2.9.2 if you remove the generics?
{quote}
OK I'll backport...

> When resolving deletes, IW should resolve in term sort order
> 
>
> Key: LUCENE-2086
> URL: https://issues.apache.org/jira/browse/LUCENE-2086
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2086.patch
>
>
> See java-dev thread "IndexWriter.updateDocument performance improvement".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Socket and file locks

2009-11-24 Thread Thomas Mueller

Hi,

> > > shouldn't active code like that live in the application layer?
> > Why?
> You can all but guarantee that polling will work at the app layer

The application layer may also run with low priority. In operating
systems, it's usually the lower layer that have more 'rights'
(priority), and not the higher levels (I'm not saying it should be
like that in Java). I just think the application layer should not have
to deal with write locks or removing write locks.

> by the time the original process realizes that it doesn't hold the lock 
> anymore, the damage could already have been done.

Yes, I'm not sure how to best avoid that (with any design). Asking the
application layer or the user whether the lock file can be removed is
probably more dangerous than trying the best in Lucene.

Standby / hibernate: the question is, if the machine process is
currently not running, does the process still hold the lock? I think
no, because the machine might as well turned off. How to detect
whether the machine is turned off versus in hibernate mode? I guess
that's a problem for all mechanisms (socket / file lock / background
thread).

When a hibernated process wakes up again, he thinks he owns the lock.
Even if the process checks before each write, it is unsafe:

if (isStillLocked()) {
  write();
}

The process could wake up after isStillLocked() but before write().
One protection is: The second process (the one that breaks the lock)
would need to work on a copy of the data instead of the original file
(it could delete / truncate the orginal file after creating a copy).
On Windows, renaming the file might work (not sure); on Linux you
probably need to copy the content to a new file. Like that, the awoken
process can only destroy inactive data.

The question is: do we need to solve this problem? How big is the
risk? Instead of solving this problem completely, you could detect it
after the fact without much overhead, and throw an exception saying:
"data may be corrupt now".

PID: With the PID, you could check if the process still runs. Or it
could be another process with the same PID (is that possible?), or the
same PID but a different machine (when using a network share). It's
probably more safe if you can communicate with the lock owner (using
TCP/IP or over the file system by deleting/creating a file).

Unique id: The easiest solution is to use a UUID (a cryptographically
secure random number). That problem _is_ solved (some systems have
trouble generating entropy, but there are workarounds). If you anyway
have a communication channel to the process, you could ask for this
UUID. One you have a communication channel, you can do a lot
(reference counting, safely transfer the lock,...).

> If the server and the client can't access each other

How to find out that the server is still running? My point is: I like
to have a secure, automatic way to break the lock if the machine or
process is stopped. And from my experience, native file locking is
problematic for this.

You could also combine solutions (such as: combine the 'open a server
socket' solution with 'background thread' solution). I'm not sure if
it's worth it to solve the 'hibernate' problem.

Regards,
Thomas

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)

2009-11-24 Thread Michael McCandless

As DM Smith said, since the bug is longstanding and we are only now
just hearing about it, it appears not to be that severe in practice.
I guess users don't often mix coord enabled & disabled BQs, that are
otherwise identical, in the same cache.

So I think we ship 3.0.0 anyways?

Mike

On Tue, Nov 24, 2009 at 2:26 AM, Uwe Schindler  wrote:
> Hi all,
>
> Hoss reported a bug about two fields missing in the equals/hashCode of
> BooleanQuery (which exists since 1.9,
> https://issues.apache.org/jira/browse/LUCENE-2092). Should I respin 3.0
> because of this or just release it? Speak out load, if you want to respin
> (else vote)!
>
> We will apply the bugfix at least to 2.9.2 and 3.0.1
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>> -Original Message-
>> From: Uwe Schindler [mailto:u...@thetaphi.de]
>> Sent: Sunday, November 22, 2009 4:07 PM
>> To: gene...@lucene.apache.org; java-dev@lucene.apache.org
>> Subject: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)
>>
>> Hi,
>>
>> I have built the artifacts for the final release of "Apache Lucene Java
>> 3.0.0" a second time, because of a bug in the TokenStream API (found by
>> Shai
>> Erera, who wanted to make "bad" things with addAttribute, breaking its
>> behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to
>> prevent
>> stack overflow, LUCENE-2087). They are targeted for release on 2009-11-25.
>>
>> The artifacts are here:
>> http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/
>>
>> You find the changes in the corresponding sub folder. The SVN revision is
>> 883080, here the manifest with build system info:
>>
>> Manifest-Version: 1.0
>> Ant-Version: Apache Ant 1.7.0
>> Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.)
>> Specification-Title: Lucene Search Engine
>> Specification-Version: 3.0.0
>> Specification-Vendor: The Apache Software Foundation
>> Implementation-Title: org.apache.lucene
>> Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49
>> Implementation-Vendor: The Apache Software Foundation
>> X-Compile-Source-JDK: 1.5
>> X-Compile-Target-JDK: 1.5
>>
>> Please vote to officially release these artifacts as "Apache Lucene Java
>> 3.0.0".
>>
>> We need at least 3 binding (PMC) votes.
>>
>> Thanks everyone for all their hard work on this and I am very sorry for
>> requesting a vote again, but that's life! Thanks Shai for the pointer to
>> the
>> bug!
>>
>>
>>
>>
>> Here is the proposed release note, please edit, if needed:
>> --
>>
>> Hello Lucene users,
>>
>> On behalf of the Lucene dev community (a growing community far larger than
>> just the committers) I would like to announce the release of Lucene Java
>> 3.0:
>>
>> The new version is mostly a cleanup release without any new features. All
>> deprecations targeted to be removed in version 3.0 were removed. If you
>> are
>> upgrading from version 2.9.1 of Lucene, you have to fix all deprecation
>> warnings in your code base to be able to recompile against this version.
>>
>> This is the first Lucene release with Java 5 as a minimum requirement. The
>> API was cleaned up to make use of Java 5's generics, varargs, enums, and
>> autoboxing. New users of Lucene are advised to use this version for new
>> developments, because it has a clean, type safe new API. Upgrading users
>> can
>> now remove unnecessary casts and add generics to their code, too. If you
>> have not upgraded your installation to Java 5, please read the file
>> JRE_VERSION_MIGRATION.txt (please note that this is not related to Lucene
>> 3.0, it will also happen with any previous release when you upgrade your
>> Java environment).
>>
>> Lucene 3.0 has some changes regarding compressed fields: 2.9 already
>> deprecated compressed fields; support for them was removed now. Lucene 3.0
>> is still able to read indexes with compressed fields, but as soon as
>> merges
>> occur or the index is optimized, all compressed fields are decompressed
>> and
>> converted to Field.Store.YES. Because of this, indexes with compressed
>> fields can suddenly get larger.
>>
>> While we generally try and maintain full backwards compatibility between
>> major versions, Lucene 3.0 has some minor breaks, mostly related to
>> deprecation removal, pointed out in the 'Changes in backwards
>> compatibility
>> policy' section of CHANGES.txt. Notable are:
>>
>> - IndexReader.open(Directory) now opens in read-only mode per default
>> (this
>> method was deprecated because of that in 2.9). The same occurs to
>> IndexSearcher.
>>
>> - Already started in 2.9, core TokenStreams are now made final to enforce
>> the decorator pattern.
>>
>> - If you interrupt an IndexWriter merge thread, IndexWriter now throws an
>> unchecked ThreadInterruptedException that extends RuntimeException and
>> clears the interrupt status.
>>
>> -

RE: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)

2009-11-24 Thread Uwe Schindler

> As DM Smith said, since the bug is longstanding and we are only now
> just hearing about it, it appears not to be that severe in practice.
> I guess users don't often mix coord enabled & disabled BQs, that are
> otherwise identical, in the same cache.

DM Smith also wanted this in 2.9.2, which I think it's fine. The fix is so
simple, we could simply merge it to 2.9 branch. And Erick Erickson also
noted that this bug is longstanding.

> So I think we ship 3.0.0 anyways?

+1, I just wanted to ask. Now votes are required, I have zero counting ones
until now.

Uwe



> On Tue, Nov 24, 2009 at 2:26 AM, Uwe Schindler  wrote:
> > Hi all,
> >
> > Hoss reported a bug about two fields missing in the equals/hashCode of
> > BooleanQuery (which exists since 1.9,
> > https://issues.apache.org/jira/browse/LUCENE-2092). Should I respin 3.0
> > because of this or just release it? Speak out load, if you want to
> respin
> > (else vote)!
> >
> > We will apply the bugfix at least to 2.9.2 and 3.0.1
> >
> > Uwe
> >
> > -
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> >> -Original Message-
> >> From: Uwe Schindler [mailto:u...@thetaphi.de]
> >> Sent: Sunday, November 22, 2009 4:07 PM
> >> To: gene...@lucene.apache.org; java-dev@lucene.apache.org
> >> Subject: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)
> >>
> >> Hi,
> >>
> >> I have built the artifacts for the final release of "Apache Lucene Java
> >> 3.0.0" a second time, because of a bug in the TokenStream API (found by
> >> Shai
> >> Erera, who wanted to make "bad" things with addAttribute, breaking its
> >> behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to
> >> prevent
> >> stack overflow, LUCENE-2087). They are targeted for release on 2009-11-
> 25.
> >>
> >> The artifacts are here:
> >> http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/
> >>
> >> You find the changes in the corresponding sub folder. The SVN revision
> is
> >> 883080, here the manifest with build system info:
> >>
> >> Manifest-Version: 1.0
> >> Ant-Version: Apache Ant 1.7.0
> >> Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.)
> >> Specification-Title: Lucene Search Engine
> >> Specification-Version: 3.0.0
> >> Specification-Vendor: The Apache Software Foundation
> >> Implementation-Title: org.apache.lucene
> >> Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49
> >> Implementation-Vendor: The Apache Software Foundation
> >> X-Compile-Source-JDK: 1.5
> >> X-Compile-Target-JDK: 1.5
> >>
> >> Please vote to officially release these artifacts as "Apache Lucene
> Java
> >> 3.0.0".
> >>
> >> We need at least 3 binding (PMC) votes.
> >>
> >> Thanks everyone for all their hard work on this and I am very sorry for
> >> requesting a vote again, but that's life! Thanks Shai for the pointer
> to
> >> the
> >> bug!
> >>
> >>
> >>
> >>
> >> Here is the proposed release note, please edit, if needed:
> >> ---
> ---
> >>
> >> Hello Lucene users,
> >>
> >> On behalf of the Lucene dev community (a growing community far larger
> than
> >> just the committers) I would like to announce the release of Lucene
> Java
> >> 3.0:
> >>
> >> The new version is mostly a cleanup release without any new features.
> All
> >> deprecations targeted to be removed in version 3.0 were removed. If you
> >> are
> >> upgrading from version 2.9.1 of Lucene, you have to fix all deprecation
> >> warnings in your code base to be able to recompile against this
> version.
> >>
> >> This is the first Lucene release with Java 5 as a minimum requirement.
> The
> >> API was cleaned up to make use of Java 5's generics, varargs, enums,
> and
> >> autoboxing. New users of Lucene are advised to use this version for new
> >> developments, because it has a clean, type safe new API. Upgrading
> users
> >> can
> >> now remove unnecessary casts and add generics to their code, too. If
> you
> >> have not upgraded your installation to Java 5, please read the file
> >> JRE_VERSION_MIGRATION.txt (please note that this is not related to
> Lucene
> >> 3.0, it will also happen with any previous release when you upgrade
> your
> >> Java environment).
> >>
> >> Lucene 3.0 has some changes regarding compressed fields: 2.9 already
> >> deprecated compressed fields; support for them was removed now. Lucene
> 3.0
> >> is still able to read indexes with compressed fields, but as soon as
> >> merges
> >> occur or the index is optimized, all compressed fields are decompressed
> >> and
> >> converted to Field.Store.YES. Because of this, indexes with compressed
> >> fields can suddenly get larger.
> >>
> >> While we generally try and maintain full backwards compatibility
> between
> >> major versions, Lucene 3.0 has some minor breaks, mostly related to
> >> deprecation removal, pointed out in the 'Changes in backwards
> >> compatibility
> >> policy' section of CHANGES.txt. Notabl

Re: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)

2009-11-24 Thread Simon Willnauer

On Tue, Nov 24, 2009 at 11:09 AM, Uwe Schindler  wrote:
>> As DM Smith said, since the bug is longstanding and we are only now
>> just hearing about it, it appears not to be that severe in practice.
>> I guess users don't often mix coord enabled & disabled BQs, that are
>> otherwise identical, in the same cache.
>
> DM Smith also wanted this in 2.9.2, which I think it's fine. The fix is so
> simple, we could simply merge it to 2.9 branch. And Erick Erickson also
> noted that this bug is longstanding.
>
>> So I think we ship 3.0.0 anyways?
>
> +1, I just wanted to ask. Now votes are required, I have zero counting ones
> until now.
+1 for not respinning 3.0 with this bug. I would also agree with the
statements above!
+1 for 3.0 even not being a PMC member :)

simon
>
> Uwe
>
>
>
>> On Tue, Nov 24, 2009 at 2:26 AM, Uwe Schindler  wrote:
>> > Hi all,
>> >
>> > Hoss reported a bug about two fields missing in the equals/hashCode of
>> > BooleanQuery (which exists since 1.9,
>> > https://issues.apache.org/jira/browse/LUCENE-2092). Should I respin 3.0
>> > because of this or just release it? Speak out load, if you want to
>> respin
>> > (else vote)!
>> >
>> > We will apply the bugfix at least to 2.9.2 and 3.0.1
>> >
>> > Uwe
>> >
>> > -
>> > Uwe Schindler
>> > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > http://www.thetaphi.de
>> > eMail: u...@thetaphi.de
>> >
>> >> -Original Message-
>> >> From: Uwe Schindler [mailto:u...@thetaphi.de]
>> >> Sent: Sunday, November 22, 2009 4:07 PM
>> >> To: gene...@lucene.apache.org; java-dev@lucene.apache.org
>> >> Subject: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)
>> >>
>> >> Hi,
>> >>
>> >> I have built the artifacts for the final release of "Apache Lucene Java
>> >> 3.0.0" a second time, because of a bug in the TokenStream API (found by
>> >> Shai
>> >> Erera, who wanted to make "bad" things with addAttribute, breaking its
>> >> behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to
>> >> prevent
>> >> stack overflow, LUCENE-2087). They are targeted for release on 2009-11-
>> 25.
>> >>
>> >> The artifacts are here:
>> >> http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/
>> >>
>> >> You find the changes in the corresponding sub folder. The SVN revision
>> is
>> >> 883080, here the manifest with build system info:
>> >>
>> >> Manifest-Version: 1.0
>> >> Ant-Version: Apache Ant 1.7.0
>> >> Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.)
>> >> Specification-Title: Lucene Search Engine
>> >> Specification-Version: 3.0.0
>> >> Specification-Vendor: The Apache Software Foundation
>> >> Implementation-Title: org.apache.lucene
>> >> Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49
>> >> Implementation-Vendor: The Apache Software Foundation
>> >> X-Compile-Source-JDK: 1.5
>> >> X-Compile-Target-JDK: 1.5
>> >>
>> >> Please vote to officially release these artifacts as "Apache Lucene
>> Java
>> >> 3.0.0".
>> >>
>> >> We need at least 3 binding (PMC) votes.
>> >>
>> >> Thanks everyone for all their hard work on this and I am very sorry for
>> >> requesting a vote again, but that's life! Thanks Shai for the pointer
>> to
>> >> the
>> >> bug!
>> >>
>> >>
>> >>
>> >>
>> >> Here is the proposed release note, please edit, if needed:
>> >> ---
>> ---
>> >>
>> >> Hello Lucene users,
>> >>
>> >> On behalf of the Lucene dev community (a growing community far larger
>> than
>> >> just the committers) I would like to announce the release of Lucene
>> Java
>> >> 3.0:
>> >>
>> >> The new version is mostly a cleanup release without any new features.
>> All
>> >> deprecations targeted to be removed in version 3.0 were removed. If you
>> >> are
>> >> upgrading from version 2.9.1 of Lucene, you have to fix all deprecation
>> >> warnings in your code base to be able to recompile against this
>> version.
>> >>
>> >> This is the first Lucene release with Java 5 as a minimum requirement.
>> The
>> >> API was cleaned up to make use of Java 5's generics, varargs, enums,
>> and
>> >> autoboxing. New users of Lucene are advised to use this version for new
>> >> developments, because it has a clean, type safe new API. Upgrading
>> users
>> >> can
>> >> now remove unnecessary casts and add generics to their code, too. If
>> you
>> >> have not upgraded your installation to Java 5, please read the file
>> >> JRE_VERSION_MIGRATION.txt (please note that this is not related to
>> Lucene
>> >> 3.0, it will also happen with any previous release when you upgrade
>> your
>> >> Java environment).
>> >>
>> >> Lucene 3.0 has some changes regarding compressed fields: 2.9 already
>> >> deprecated compressed fields; support for them was removed now. Lucene
>> 3.0
>> >> is still able to read indexes with compressed fields, but as soon as
>> >> merges
>> >> occur or the index is optimized, all compressed fields are decompressed
>> >> and
>> >> converted to Field.Store.YES. Because of this, indexes with com

Re: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)

2009-11-24 Thread Michael McCandless

+1 to release the current artifacts as 3.0.0!

Mike

On Tue, Nov 24, 2009 at 5:11 AM, Simon Willnauer
 wrote:
> On Tue, Nov 24, 2009 at 11:09 AM, Uwe Schindler  wrote:
>>> As DM Smith said, since the bug is longstanding and we are only now
>>> just hearing about it, it appears not to be that severe in practice.
>>> I guess users don't often mix coord enabled & disabled BQs, that are
>>> otherwise identical, in the same cache.
>>
>> DM Smith also wanted this in 2.9.2, which I think it's fine. The fix is so
>> simple, we could simply merge it to 2.9 branch. And Erick Erickson also
>> noted that this bug is longstanding.
>>
>>> So I think we ship 3.0.0 anyways?
>>
>> +1, I just wanted to ask. Now votes are required, I have zero counting ones
>> until now.
> +1 for not respinning 3.0 with this bug. I would also agree with the
> statements above!
> +1 for 3.0 even not being a PMC member :)
>
> simon
>>
>> Uwe
>>
>>
>>
>>> On Tue, Nov 24, 2009 at 2:26 AM, Uwe Schindler  wrote:
>>> > Hi all,
>>> >
>>> > Hoss reported a bug about two fields missing in the equals/hashCode of
>>> > BooleanQuery (which exists since 1.9,
>>> > https://issues.apache.org/jira/browse/LUCENE-2092). Should I respin 3.0
>>> > because of this or just release it? Speak out load, if you want to
>>> respin
>>> > (else vote)!
>>> >
>>> > We will apply the bugfix at least to 2.9.2 and 3.0.1
>>> >
>>> > Uwe
>>> >
>>> > -
>>> > Uwe Schindler
>>> > H.-H.-Meier-Allee 63, D-28213 Bremen
>>> > http://www.thetaphi.de
>>> > eMail: u...@thetaphi.de
>>> >
>>> >> -Original Message-
>>> >> From: Uwe Schindler [mailto:u...@thetaphi.de]
>>> >> Sent: Sunday, November 22, 2009 4:07 PM
>>> >> To: gene...@lucene.apache.org; java-dev@lucene.apache.org
>>> >> Subject: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)
>>> >>
>>> >> Hi,
>>> >>
>>> >> I have built the artifacts for the final release of "Apache Lucene Java
>>> >> 3.0.0" a second time, because of a bug in the TokenStream API (found by
>>> >> Shai
>>> >> Erera, who wanted to make "bad" things with addAttribute, breaking its
>>> >> behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to
>>> >> prevent
>>> >> stack overflow, LUCENE-2087). They are targeted for release on 2009-11-
>>> 25.
>>> >>
>>> >> The artifacts are here:
>>> >> http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/
>>> >>
>>> >> You find the changes in the corresponding sub folder. The SVN revision
>>> is
>>> >> 883080, here the manifest with build system info:
>>> >>
>>> >> Manifest-Version: 1.0
>>> >> Ant-Version: Apache Ant 1.7.0
>>> >> Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.)
>>> >> Specification-Title: Lucene Search Engine
>>> >> Specification-Version: 3.0.0
>>> >> Specification-Vendor: The Apache Software Foundation
>>> >> Implementation-Title: org.apache.lucene
>>> >> Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49
>>> >> Implementation-Vendor: The Apache Software Foundation
>>> >> X-Compile-Source-JDK: 1.5
>>> >> X-Compile-Target-JDK: 1.5
>>> >>
>>> >> Please vote to officially release these artifacts as "Apache Lucene
>>> Java
>>> >> 3.0.0".
>>> >>
>>> >> We need at least 3 binding (PMC) votes.
>>> >>
>>> >> Thanks everyone for all their hard work on this and I am very sorry for
>>> >> requesting a vote again, but that's life! Thanks Shai for the pointer
>>> to
>>> >> the
>>> >> bug!
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Here is the proposed release note, please edit, if needed:
>>> >> ---
>>> ---
>>> >>
>>> >> Hello Lucene users,
>>> >>
>>> >> On behalf of the Lucene dev community (a growing community far larger
>>> than
>>> >> just the committers) I would like to announce the release of Lucene
>>> Java
>>> >> 3.0:
>>> >>
>>> >> The new version is mostly a cleanup release without any new features.
>>> All
>>> >> deprecations targeted to be removed in version 3.0 were removed. If you
>>> >> are
>>> >> upgrading from version 2.9.1 of Lucene, you have to fix all deprecation
>>> >> warnings in your code base to be able to recompile against this
>>> version.
>>> >>
>>> >> This is the first Lucene release with Java 5 as a minimum requirement.
>>> The
>>> >> API was cleaned up to make use of Java 5's generics, varargs, enums,
>>> and
>>> >> autoboxing. New users of Lucene are advised to use this version for new
>>> >> developments, because it has a clean, type safe new API. Upgrading
>>> users
>>> >> can
>>> >> now remove unnecessary casts and add generics to their code, too. If
>>> you
>>> >> have not upgraded your installation to Java 5, please read the file
>>> >> JRE_VERSION_MIGRATION.txt (please note that this is not related to
>>> Lucene
>>> >> 3.0, it will also happen with any previous release when you upgrade
>>> your
>>> >> Java environment).
>>> >>
>>> >> Lucene 3.0 has some changes regarding compressed fields: 2.9 already
>>> >> deprecated compressed fields; support for them was removed now. Lucene
>>>

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

2009-11-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781855#action_12781855
 ] 

Uwe Schindler commented on LUCENE-2075:
---

Just one question: The cache is initialized with max 1024 entries. Why that 
number. If we share the cache between multiple threads, maybe we should raise 
the max size. Or make it configureable?

The entries in the cache are not very costly, why not use 8192 or 16384, MTQs 
would be happy with that?

> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781859#action_12781859
 ] 

Michael McCandless commented on LUCENE-1458:


{quote}
in trunk, things sort in UTF-16 binary order.
in branch, things sort in UTF-8 binary order.
these are different...
{quote}

Ugh!  In the back of my mind I almost remembered this... I think this
was one reason why I didn't do this back in LUCENE-843 (I think we had
discussed this already, then... though maybe I'm suffering from déjà
vu).  I could swear at one point I had that fixup logic implemented in
a UTF-8/16 comparison method...

UTF-8 sort order (what flex branch has switched to) is true unicode
codepoint sort order, while UTF-16 is not when there are surrogate
pairs as well as high (>= U+E000) unicode chars.  Sigh

So this is definitely a back compat problem.  And, unfortunately, even
if we like the true codepoint sort order, it's not easy to switch to
in a back-compat manner because if we write new segments into an old
index, SegmentMerger will be in big trouble when it tries to merge two
segments that had sorted the terms differently.

I would also prefer true codepoint sort order... but we can't break
back compat.

Though it would be nice to let the codec control the sort order -- eg
then (I think?) the ICU/CollationKeyFilter workaround wouldn't be
needed.

Fortunately the problem is isolated to how we sort the buffered
postings when it's time to flush a new segment, so I think w/ the
appropriate fixup logic (eg your comment at
https://issues.apache.org/jira/browse/LUCENE-1606?focusedCommentId=12781746&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12781746)
when comparing terms in oal.index.TermsHashPerField.comparePostings
during that sort, we can get back to UTF-16 sort order.


> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file)

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

2009-11-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781863#action_12781863
 ] 

Michael McCandless commented on LUCENE-2075:


Well, I just kept 1024 since that's what we currently do ;)

OK I just did a rough tally -- I think we're looking at ~100 bytes (on
32 bit JRE) per entry, including CHMs HashEntry, array in CHM,
TermInfoAndOrd, Term & its String text.

Not to mention DBLRU has 2X multiplier at peak, so 200 bytes.

So at 1024 we're looking at ~200KB peak used by this cache already,
per segment which is able to saturate that cache... so for a 20
segment index you're at ~4MB additional RAM consumed... so I don't
think we should increase this default.

Also, I don't think this cache is/should be attempting to achieve a
high hit rate *across* queries, only *within* a single query when that
query resolves the Term more than once.

I think caches that wrap more CPU, like Solr's query cache, are where
the app should aim for high hit rate.

Maybe we should even decrease the default size here -- what's
important is preventing in-fligh queries from evicting one another's
cache entries.

For NRQ, 1024 is apparently already plenty big for that (relatively
few seeks occur).

For automaton query, which does lots of seeking, once flex branch
lands there is no need for the cache (each lookup is done only once,
because the TermsEnum actualEnum is able to seek).  Before flex lands,
the cache is important, but only for automaton query I think.

And honestly I'm still tempted to do away with this cache altogether
and create a "query scope", private to each query while it's running,
where terms dict (and other places that need to, over time) could
store stuff.  That'd give a perfect within-query hit rate and wouldn't
tie up any long term RAM...


> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

2009-11-24 Thread Simon Willnauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2034:


Attachment: LUCENE-2034,patch

Updated the patch to the current trunk.
I have not removed all the deprecated methods in contrib/analyzers yet - we 
should open another issue for that IMO.
Yet this patch still brakes back compatibility as some of the none final 
contrib analyzers extend StopawareAnalyzer with makes the old tokenstream / 
reusableTokenstream methods final. IMO this should not block this issues for 
the following reasons:
1. its in contrib - different story for core
2. it is super easy to port them
3. it make the API cleaner and has less code
4. those analyzers might have to change anyway due to the deprecated methods


simon

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034.patch, LUCENE-2034.patch, 
> LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses 
> need to implement at least one of the methodes returning a tokenStream. When 
> you look at the code it appears to be almost identical if both are 
> implemented in the same analyzer.  Each analyzer defnes the same inner class 
> (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his 
> own way of loading them or defines a large number of ctors to load stopwords 
> from a file, set, arrays etc.. those ctors should be removed / deprecated and 
> eventually removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

2009-11-24 Thread Simon Willnauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2034:


Attachment: LUCENE-2034,patch

set svn EOF property to native - missed that in the last patch

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, 
> LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses 
> need to implement at least one of the methodes returning a tokenStream. When 
> you look at the code it appears to be almost identical if both are 
> implemented in the same analyzer.  Each analyzer defnes the same inner class 
> (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his 
> own way of loading them or defines a large number of ctors to load stopwords 
> from a file, set, arrays etc.. those ctors should be removed / deprecated and 
> eventually removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

2009-11-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781869#action_12781869
 ] 

Uwe Schindler commented on LUCENE-2075:
---

{quote}
And honestly I'm still tempted to do away with this cache altogether
and create a "query scope", private to each query while it's running,
where terms dict (and other places that need to, over time) could
store stuff. That'd give a perfect within-query hit rate and wouldn't
tie up any long term RAM...
{quote}

With Query Scope you mean a whole query, so not only a MTQ? If you combine 
multiple AutomatonQueries in a BooleanQuery it could also profit from the cache 
(as it is currently).

I think until Flex, we should commit this and use the cache. When Flex is out, 
we may think of doing this different.

> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

2009-11-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781870#action_12781870
 ] 

Uwe Schindler commented on LUCENE-2034:
---

bq. set svn EOF property to native - missed that in the last patch 
You can cofigure your SVN client to do it automatically and also add the $ID$ 
props.

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, 
> LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses 
> need to implement at least one of the methodes returning a tokenStream. When 
> you look at the code it appears to be almost identical if both are 
> implemented in the same analyzer.  Each analyzer defnes the same inner class 
> (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his 
> own way of loading them or defines a large number of ctors to load stopwords 
> from a file, set, arrays etc.. those ctors should be removed / deprecated and 
> eventually removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781874#action_12781874
 ] 

Robert Muir commented on LUCENE-1458:
-

{quote}
Though it would be nice to let the codec control the sort order - eg
then (I think?) the ICU/CollationKeyFilter workaround wouldn't be
needed.
{quote}

I like this idea by the way, "flexible sorting".  although i like codepoint 
order better than code unit order, i hate binary order in general to be honest. 

its nice we have 'indexable'/fast collation right now, but its maybe not what 
users expect either (binary keys encoded into text).


> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

2009-11-24 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned LUCENE-2034:
---

Assignee: Robert Muir

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, 
> LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses 
> need to implement at least one of the methodes returning a tokenStream. When 
> you look at the code it appears to be almost identical if both are 
> implemented in the same analyzer.  Each analyzer defnes the same inner class 
> (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his 
> own way of loading them or defines a large number of ctors to load stopwords 
> from a file, set, arrays etc.. those ctors should be removed / deprecated and 
> eventually removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt

2009-11-24 Thread Uwe Schindler

Do we need a new 3.0? (duck) - but it's fixed only at wrong position in
changes.

But we should also fix the 3.0 branch for 3.0.1

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: mikemcc...@apache.org [mailto:mikemcc...@apache.org]
> Sent: Tuesday, November 24, 2009 12:20 PM
> To: java-comm...@lucene.apache.org
> Subject: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt
> 
> Author: mikemccand
> Date: Tue Nov 24 11:19:43 2009
> New Revision: 883654
> 
> URL: http://svn.apache.org/viewvc?rev=883654&view=rev
> Log:
> LUCENE-2045: fix CHANGES entry (this was fixed in 2.9.2/3.0, not 2.9.1)
> 
> Modified:
> lucene/java/trunk/CHANGES.txt
> 
> Modified: lucene/java/trunk/CHANGES.txt
> URL:
> http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?rev=883654&r1=8
> 83653&r2=883654&view=diff
> ==
> 
> --- lucene/java/trunk/CHANGES.txt (original)
> +++ lucene/java/trunk/CHANGES.txt Tue Nov 24 11:19:43 2009
> @@ -188,6 +188,10 @@
>  * LUCENE-2088: addAttribute() should only accept interfaces that
>extend Attribute. (Shai Erera, Uwe Schindler)
> 
> +* LUCENE-2045: Fix silly FileNotFoundException hit if you enable
> +  infoStream on IndexWriter and then add an empty document and commit
> +  (Shai Erera via Mike McCandless)
> +
>  New features
> 
>  * LUCENE-1933: Provide a convenience AttributeFactory that creates a
> @@ -258,10 +262,6 @@
> char (U+FFFD) during indexing, to prevent silent index corruption.
> (Peter Keegan, Mike McCandless)
> 
> - * LUCENE-2045: Fix silly FileNotFoundException hit if you enable
> -   infoStream on IndexWriter and then add an empty document and commit
> -   (Shai Erera via Mike McCandless)
> -
>   * LUCENE-2046: IndexReader should not see the index as changed, after
> IndexWriter.prepareCommit has been called but before
> IndexWriter.commit is called. (Peter Keegan via Mike McCandless)
> 



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt

2009-11-24 Thread Uwe Schindler

I have seen, we have the same problem with the next changes entry, 2046 :(

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Uwe Schindler [mailto:u...@thetaphi.de]
> Sent: Tuesday, November 24, 2009 12:26 PM
> To: java-dev@lucene.apache.org
> Subject: RE: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt
> 
> Do we need a new 3.0? (duck) - but it's fixed only at wrong position in
> changes.
> 
> But we should also fix the 3.0 branch for 3.0.1
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
> > -Original Message-
> > From: mikemcc...@apache.org [mailto:mikemcc...@apache.org]
> > Sent: Tuesday, November 24, 2009 12:20 PM
> > To: java-comm...@lucene.apache.org
> > Subject: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt
> >
> > Author: mikemccand
> > Date: Tue Nov 24 11:19:43 2009
> > New Revision: 883654
> >
> > URL: http://svn.apache.org/viewvc?rev=883654&view=rev
> > Log:
> > LUCENE-2045: fix CHANGES entry (this was fixed in 2.9.2/3.0, not 2.9.1)
> >
> > Modified:
> > lucene/java/trunk/CHANGES.txt
> >
> > Modified: lucene/java/trunk/CHANGES.txt
> > URL:
> >
> http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?rev=883654&r1=8
> > 83653&r2=883654&view=diff
> >
> ==
> > 
> > --- lucene/java/trunk/CHANGES.txt (original)
> > +++ lucene/java/trunk/CHANGES.txt Tue Nov 24 11:19:43 2009
> > @@ -188,6 +188,10 @@
> >  * LUCENE-2088: addAttribute() should only accept interfaces that
> >extend Attribute. (Shai Erera, Uwe Schindler)
> >
> > +* LUCENE-2045: Fix silly FileNotFoundException hit if you enable
> > +  infoStream on IndexWriter and then add an empty document and commit
> > +  (Shai Erera via Mike McCandless)
> > +
> >  New features
> >
> >  * LUCENE-1933: Provide a convenience AttributeFactory that creates a
> > @@ -258,10 +262,6 @@
> > char (U+FFFD) during indexing, to prevent silent index corruption.
> > (Peter Keegan, Mike McCandless)
> >
> > - * LUCENE-2045: Fix silly FileNotFoundException hit if you enable
> > -   infoStream on IndexWriter and then add an empty document and commit
> > -   (Shai Erera via Mike McCandless)
> > -
> >   * LUCENE-2046: IndexReader should not see the index as changed, after
> > IndexWriter.prepareCommit has been called but before
> > IndexWriter.commit is called. (Peter Keegan via Mike McCandless)
> >
> 
> 
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt

2009-11-24 Thread Michael McCandless

OK looks like you fixed LUCENE-2046 as well, and ported both fixes to
3.0.x CHANGES.

I don't think this merits a 3.0.0 respin.

Though I wonder if there are other issues that got incorrectly moved into 2.9.1?

Mike

On Tue, Nov 24, 2009 at 6:26 AM, Uwe Schindler  wrote:
> Do we need a new 3.0? (duck) - but it's fixed only at wrong position in
> changes.
>
> But we should also fix the 3.0 branch for 3.0.1
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: mikemcc...@apache.org [mailto:mikemcc...@apache.org]
>> Sent: Tuesday, November 24, 2009 12:20 PM
>> To: java-comm...@lucene.apache.org
>> Subject: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt
>>
>> Author: mikemccand
>> Date: Tue Nov 24 11:19:43 2009
>> New Revision: 883654
>>
>> URL: http://svn.apache.org/viewvc?rev=883654&view=rev
>> Log:
>> LUCENE-2045: fix CHANGES entry (this was fixed in 2.9.2/3.0, not 2.9.1)
>>
>> Modified:
>>     lucene/java/trunk/CHANGES.txt
>>
>> Modified: lucene/java/trunk/CHANGES.txt
>> URL:
>> http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?rev=883654&r1=8
>> 83653&r2=883654&view=diff
>> ==
>> 
>> --- lucene/java/trunk/CHANGES.txt (original)
>> +++ lucene/java/trunk/CHANGES.txt Tue Nov 24 11:19:43 2009
>> @@ -188,6 +188,10 @@
>>  * LUCENE-2088: addAttribute() should only accept interfaces that
>>    extend Attribute. (Shai Erera, Uwe Schindler)
>>
>> +* LUCENE-2045: Fix silly FileNotFoundException hit if you enable
>> +  infoStream on IndexWriter and then add an empty document and commit
>> +  (Shai Erera via Mike McCandless)
>> +
>>  New features
>>
>>  * LUCENE-1933: Provide a convenience AttributeFactory that creates a
>> @@ -258,10 +262,6 @@
>>     char (U+FFFD) during indexing, to prevent silent index corruption.
>>     (Peter Keegan, Mike McCandless)
>>
>> - * LUCENE-2045: Fix silly FileNotFoundException hit if you enable
>> -   infoStream on IndexWriter and then add an empty document and commit
>> -   (Shai Erera via Mike McCandless)
>> -
>>   * LUCENE-2046: IndexReader should not see the index as changed, after
>>     IndexWriter.prepareCommit has been called but before
>>     IndexWriter.commit is called. (Peter Keegan via Mike McCandless)
>>
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt

2009-11-24 Thread Uwe Schindler

I looked through the 2.9.1 changes and found none that was too new.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Tuesday, November 24, 2009 12:39 PM
> To: java-dev@lucene.apache.org
> Subject: Re: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt
> 
> OK looks like you fixed LUCENE-2046 as well, and ported both fixes to
> 3.0.x CHANGES.
> 
> I don't think this merits a 3.0.0 respin.
> 
> Though I wonder if there are other issues that got incorrectly moved into
> 2.9.1?
> 
> Mike
> 
> On Tue, Nov 24, 2009 at 6:26 AM, Uwe Schindler  wrote:
> > Do we need a new 3.0? (duck) - but it's fixed only at wrong position in
> > changes.
> >
> > But we should also fix the 3.0 branch for 3.0.1
> >
> > -
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> >
> >> -Original Message-
> >> From: mikemcc...@apache.org [mailto:mikemcc...@apache.org]
> >> Sent: Tuesday, November 24, 2009 12:20 PM
> >> To: java-comm...@lucene.apache.org
> >> Subject: svn commit: r883654 - /lucene/java/trunk/CHANGES.txt
> >>
> >> Author: mikemccand
> >> Date: Tue Nov 24 11:19:43 2009
> >> New Revision: 883654
> >>
> >> URL: http://svn.apache.org/viewvc?rev=883654&view=rev
> >> Log:
> >> LUCENE-2045: fix CHANGES entry (this was fixed in 2.9.2/3.0, not 2.9.1)
> >>
> >> Modified:
> >>     lucene/java/trunk/CHANGES.txt
> >>
> >> Modified: lucene/java/trunk/CHANGES.txt
> >> URL:
> >>
> http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?rev=883654&r1=8
> >> 83653&r2=883654&view=diff
> >>
> ==
> >> 
> >> --- lucene/java/trunk/CHANGES.txt (original)
> >> +++ lucene/java/trunk/CHANGES.txt Tue Nov 24 11:19:43 2009
> >> @@ -188,6 +188,10 @@
> >>  * LUCENE-2088: addAttribute() should only accept interfaces that
> >>    extend Attribute. (Shai Erera, Uwe Schindler)
> >>
> >> +* LUCENE-2045: Fix silly FileNotFoundException hit if you enable
> >> +  infoStream on IndexWriter and then add an empty document and commit
> >> +  (Shai Erera via Mike McCandless)
> >> +
> >>  New features
> >>
> >>  * LUCENE-1933: Provide a convenience AttributeFactory that creates a
> >> @@ -258,10 +262,6 @@
> >>     char (U+FFFD) during indexing, to prevent silent index corruption.
> >>     (Peter Keegan, Mike McCandless)
> >>
> >> - * LUCENE-2045: Fix silly FileNotFoundException hit if you enable
> >> -   infoStream on IndexWriter and then add an empty document and commit
> >> -   (Shai Erera via Mike McCandless)
> >> -
> >>   * LUCENE-2046: IndexReader should not see the index as changed, after
> >>     IndexWriter.prepareCommit has been called but before
> >>     IndexWriter.commit is called. (Peter Keegan via Mike McCandless)
> >>
> >
> >
> >
> > -
> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >
> >
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

2009-11-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781876#action_12781876
 ] 

Michael McCandless commented on LUCENE-2075:


bq. With Query Scope you mean a whole query, so not only a MTQ? If you combine 
multiple AutomatonQueries in a BooleanQuery it could also profit from the cache 
(as it is currently).

Right, I think the top level query would open up the scope... and free it once 
it's done running.

bq. I think until Flex, we should commit this and use the cache. When Flex is 
out, we may think of doing this different.

OK let's go with the shared cache for now, and revisit once flex lands.  I'll 
open a new issue...

But should we drop cache to maybe 512?  Typing up 4MB RAM (with cache size 
1024) for a "normal" index is kinda alot...

> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

2009-11-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781878#action_12781878
 ] 

Michael McCandless commented on LUCENE-2075:


OK I opened LUCENE-2093 to track the "query private scope" idea.

> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781877#action_12781877
 ] 

Robert Muir commented on LUCENE-2034:
-

Simon in my opinion it is ok, about making tokenstream/reusablets final for 
those non-final contrib analyzers.

i think you should make those non-final analyzers final, too. 

then we can get rid of complexity for sure.


> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, 
> LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses 
> need to implement at least one of the methodes returning a tokenStream. When 
> you look at the code it appears to be almost identical if both are 
> implemented in the same analyzer.  Each analyzer defnes the same inner class 
> (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his 
> own way of loading them or defines a large number of ctors to load stopwords 
> from a file, set, arrays etc.. those ctors should be removed / deprecated and 
> eventually removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

2009-11-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781880#action_12781880
 ] 

Uwe Schindler commented on LUCENE-2075:
---

I would keep it as it is, because we already minimized memory requirements, 
because before the cache was per-thread.

> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2093) Use query-private scope instead of shared Term->TermInfo cache

2009-11-24 Thread Michael McCandless (JIRA)

Use query-private scope instead of shared Term->TermInfo cache
--

 Key: LUCENE-2093
 URL: https://issues.apache.org/jira/browse/LUCENE-2093
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Priority: Minor
 Fix For: 3.1


Spinoff of LUCENE-2075.

We currently use a shared terms cache so multiple resolves of the same term 
within execution of a single query save CPU.  But this ties up a good amount of 
long term RAM...

So, it might be better to instead create a "query private scope", where places 
in Lucene like the terms dict could store & retrieve results.  The scope would 
be private to each running query, and would be GCable as soon as the query 
completes.  Then we've have perfect within query hit rate...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781882#action_12781882
 ] 

Robert Muir commented on LUCENE-2075:
-

i am still triyng to figure out the use case.

bq. With Query Scope you mean a whole query, so not only a MTQ? If you combine 
multiple AutomatonQueries in a BooleanQuery it could also profit from the cache 
(as it is currently).

isn't there a method i can use to force these to combine into one 
AutomatonQuery (I can use union, intersection, etc)?
I haven't done this, but we shouldnt create a private scoped-cache for 
something like this?

> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

2009-11-24 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781884#action_12781884
 ] 

Simon Willnauer commented on LUCENE-2034:
-

bq. i think you should make those non-final analyzers final, too. 
+1

I think the analyzers should always be final. Maybe there are special cases but 
for the most of them nobody should subclass.
Same amount of work  to make your own anyway.

simon

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, 
> LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses 
> need to implement at least one of the methodes returning a tokenStream. When 
> you look at the code it appears to be almost identical if both are 
> implemented in the same analyzer.  Each analyzer defnes the same inner class 
> (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his 
> own way of loading them or defines a large number of ctors to load stopwords 
> from a file, set, arrays etc.. those ctors should be removed / deprecated and 
> eventually removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

2009-11-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781885#action_12781885
 ] 

Uwe Schindler commented on LUCENE-2075:
---

...not only AutomatonQueries can be combined, they can also be combined with 
other queries and then make use of the cache.

> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781886#action_12781886
 ] 

Robert Muir commented on LUCENE-2075:
-

Uwe i just wonder if the cache would in practice get used much. 

> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

2009-11-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781887#action_12781887
 ] 

Uwe Schindler commented on LUCENE-2075:
---

For testing we could add two AtomicIntegers to the cache that counts hits and 
requests to get a hit rate, only temporary to not affect performance.

> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1606:


Attachment: LUCENE-1606.patch

updated patch:
* don't seek to high surrogates, instead tack on \uDC00. this still works for 
trunk, but also with flex branch.
* don't use a high surrogate prefix, instead truncate. this isn't being used at 
all, i would rather use 'constant suffix'
* add tests that will break if lucene's sort order is not UTF-16 (or if 
automaton is not adjusted to the new sort order)
* add another enum constructor, where you can specify smart or dumb mode 
yourself
* regexp javadoc note
* add wordage to LICENSE, not just NOTICE


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1606:


Attachment: (was: LUCENE-1606.patch)

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1606:


Attachment: LUCENE-1606.patch

sorry, my ide added a @author tag. i need to look to see where to turn this 
@author generation off for eclipse.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781899#action_12781899
 ] 

Michael McCandless commented on LUCENE-1458:


bq. i hate binary order in general to be honest.

But binary order in this case is code point order.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781904#action_12781904
 ] 

Robert Muir commented on LUCENE-1458:
-

Mike, I guess I mean i'd prefer UCA order, which isn't just the order 
codepoints happened to randomly appear on charts, but is actually designed for 
sorting and ordering things :)

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

2009-11-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781908#action_12781908
 ] 

Michael McCandless commented on LUCENE-2075:


bq. I would keep it as it is, because we already minimized memory requirements, 
because before the cache was per-thread.

OK let's leave it at 1024, but with flex (which automaton query no longer needs 
the cache for), I think we should drop it and/or cutover to query-private 
scope.  I don't think sucking up 4 MB of RAM for this rather limited purpose is 
warranted.  I'll add a comment on LUCENE-2093.

> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2093) Use query-private scope instead of shared Term->TermInfo cache

2009-11-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781910#action_12781910
 ] 

Michael McCandless commented on LUCENE-2093:


If we don't do this in 3.1, we should at least drop the size of the terms dict 
cache -- by rough math, that cache will consume 4 MB on a 20 segment index, 
even for a smallish index.

When flex lands, the cache is no longer beneficial for automaton query so it 
need not be so large.

> Use query-private scope instead of shared Term->TermInfo cache
> --
>
> Key: LUCENE-2093
> URL: https://issues.apache.org/jira/browse/LUCENE-2093
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
>
> Spinoff of LUCENE-2075.
> We currently use a shared terms cache so multiple resolves of the same term 
> within execution of a single query save CPU.  But this ties up a good amount 
> of long term RAM...
> So, it might be better to instead create a "query private scope", where 
> places in Lucene like the terms dict could store & retrieve results.  The 
> scope would be private to each running query, and would be GCable as soon as 
> the query completes.  Then we've have perfect within query hit rate...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781911#action_12781911
 ] 

Uwe Schindler commented on LUCENE-1606:
---

what is UTF-38? :-) I think you mean UTF-32, if such exists.

Else it looks good!

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

2009-11-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781913#action_12781913
 ] 

Michael McCandless commented on LUCENE-2075:


bq. Uwe i just wonder if the cache would in practice get used much.

This cache (mapping Term -> TermInfo) does get used alot for "normal"
atomic queries we first hit the terms dict to get the docFreq (to
compute idf), then later hit it again with the exact same term, to
get the TermDocs enum.

So, for these queries our hit rate is 50%, but, it's rather overkill
to be using a shared cache for this (query-private scope is much
cleaner).  EG a large automaton query running concurrently with other
queries could evict entries before they read the term the 2nd time.

Existing MTQs (except NRQ) which seek once and then scan to completion
don't hit the cache (though, I think they do double-load each term,
which is wasteful; likely this is part of the perf gains for flex).

NRQ doens't do enough seeking wrt iterating/collecting the docs for
the cache to make that much a difference.

The upcoming automaton query should benefit however in testing we
saw only the full linear scan benefit, which I'm still needing to get
to the bottom of.


> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781914#action_12781914
 ] 

Robert Muir commented on LUCENE-1606:
-

i think there is one last problem with this for flex branch, where you have 
abacadaba\uFFFC, abacadaba\uFFFD and abacadaba\uFFFE  in the term dictionary, 
but a regex the matches say abacadaba[\uFFFC\uFFFE]. in this case, the match on 
abacadaba\uFFFD will fail, it will try to seek to the "next" string, which is 
abacadaba\uFFFE, but the FFFE will get replaced by FFFD by the byte conversion, 
and we will loop.

mike i don't think this should be any back compat concern, unlike the high 
surrogate case which i think many CJK applications are probably doing to...


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781914#action_12781914
 ] 

Robert Muir edited comment on LUCENE-1606 at 11/24/09 1:30 PM:
---

i think there is one last problem with this for flex branch, where you have 
abacadaba\uFFFC, abacadaba\uFFFD and abacadaba\uFFFE  in the term dictionary, 
but a regex the matches say abacadaba[\uFFFC\u]. in this case, the match on 
abacadaba\uFFFD will fail, it will try to seek to the "next" string, which is 
abacadaba\u, but the  will get replaced by FFFD by the byte conversion, 
and we will loop.

mike i don't think this should be any back compat concern, unlike the high 
surrogate case which i think many CJK applications are probably doing to...


  was (Author: rcmuir):
i think there is one last problem with this for flex branch, where you have 
abacadaba\uFFFC, abacadaba\uFFFD and abacadaba\uFFFE  in the term dictionary, 
but a regex the matches say abacadaba[\uFFFC\uFFFE]. in this case, the match on 
abacadaba\uFFFD will fail, it will try to seek to the "next" string, which is 
abacadaba\uFFFE, but the FFFE will get replaced by FFFD by the byte conversion, 
and we will loop.

mike i don't think this should be any back compat concern, unlike the high 
surrogate case which i think many CJK applications are probably doing to...

  
> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781915#action_12781915
 ] 

Robert Muir commented on LUCENE-1606:
-

Uwe, where do you see UTF-38 :)

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781916#action_12781916
 ] 

Robert Muir commented on LUCENE-2075:
-

Thanks mike, thats what I was missing
hitting the terms dict twice in the common case explains it to me :)


> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781917#action_12781917
 ] 

Michael McCandless commented on LUCENE-1458:


bq. Mike, I guess I mean i'd prefer UCA order, which isn't just the order 
codepoints happened to randomly appear on charts, but is actually designed for 
sorting and ordering things 

Ahh, gotchya.  Well if we make the sort order pluggable, you could do that...

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781922#action_12781922
 ] 

Uwe Schindler commented on LUCENE-1606:
---

bq. Uwe, where do you see UTF-38  
Patch line 6025.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order

2009-11-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781925#action_12781925
 ] 

Michael McCandless commented on LUCENE-2086:


Backported to 3.0.x...

2.9.x next.

> When resolving deletes, IW should resolve in term sort order
> 
>
> Key: LUCENE-2086
> URL: https://issues.apache.org/jira/browse/LUCENE-2086
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2086.patch
>
>
> See java-dev thread "IndexWriter.updateDocument performance improvement".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781924#action_12781924
 ] 

Uwe Schindler commented on LUCENE-1606:
---

about the cleanupPrefix method: it is only used in the linear case to initially 
set the termenum. What happens if the nextString() method returs such a string 
ussed to seek the next enum?

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781923#action_12781923
 ] 

Robert Muir commented on LUCENE-1458:
-

bq. Ahh, gotchya. Well if we make the sort order pluggable, you could do that...

yes, then we could consider getting rid of the Collator/Locale-based range 
queries / sorts and things like that completely... which have performance 
problems.
you would have a better way to do it... 

but if you change the sort order, any part of lucene sensitive to it might 
break... maybe its dangerous.

maybe if we do it, it needs to be exposed properly so other components can 
change their behavior


> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781926#action_12781926
 ] 

Robert Muir commented on LUCENE-1606:
-

bq. about the cleanupPrefix method: it is only used in the linear case to 
initially set the termenum. What happens if the nextString() method returs such 
a string ussed to seek the next enum? 

look at the code to nextString() itself. 
it uses cleanSeek() which works differently.

when seeking, we can append \uDC00 to achieve the same thing as seeking to a 
high surrogate.
when using a prefix, we have to truncate the high surrogate, because we cannot 
use it with TermRef.startsWith() etc, it cannot be converted into UTF-8 bytes. 
(and we can't use the \uDC00 trick, obviously, or startsWith() will return 
false when it should not)

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781927#action_12781927
 ] 

Michael McCandless commented on LUCENE-1458:


Yes, this (customizing comparator for termrefs) would definitely be very 
advanced stuff...  you'd have to create your own codec to do it.  And we'd 
default to UTF16 sort order for back compat.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781926#action_12781926
 ] 

Robert Muir edited comment on LUCENE-1606 at 11/24/09 1:44 PM:
---

bq. about the cleanupPrefix method: it is only used in the linear case to 
initially set the termenum. What happens if the nextString() method returs such 
a string ussed to seek the next enum? 

look at the code to nextString() itself. 
it uses cleanupPosition() which works differently.

when seeking, we can append \uDC00 to achieve the same thing as seeking to a 
high surrogate.
when using a prefix, we have to truncate the high surrogate, because we cannot 
use it with TermRef.startsWith() etc, it cannot be converted into UTF-8 bytes. 
(and we can't use the \uDC00 trick, obviously, or startsWith() will return 
false when it should not)

  was (Author: rcmuir):
bq. about the cleanupPrefix method: it is only used in the linear case to 
initially set the termenum. What happens if the nextString() method returs such 
a string ussed to seek the next enum? 

look at the code to nextString() itself. 
it uses cleanSeek() which works differently.

when seeking, we can append \uDC00 to achieve the same thing as seeking to a 
high surrogate.
when using a prefix, we have to truncate the high surrogate, because we cannot 
use it with TermRef.startsWith() etc, it cannot be converted into UTF-8 bytes. 
(and we can't use the \uDC00 trick, obviously, or startsWith() will return 
false when it should not)
  
> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2086) When resolving deletes, IW should resolve in term sort order

2009-11-24 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-2086.


Resolution: Fixed

OK backported to 2.9.x.

> When resolving deletes, IW should resolve in term sort order
> 
>
> Key: LUCENE-2086
> URL: https://issues.apache.org/jira/browse/LUCENE-2086
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2086.patch
>
>
> See java-dev thread "IndexWriter.updateDocument performance improvement".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781935#action_12781935
 ] 

Robert Muir commented on LUCENE-1458:
-

bq. Yes, this (customizing comparator for termrefs) would definitely be very 
advanced stuff... you'd have to create your own codec to do it. And we'd 
default to UTF16 sort order for back compat.

Agreed, changing the sort order breaks a lot of things (not just some crazy 
seeking around code that I write)

i.e. if 'ch' is a character in some collator and sorts b, before c (completely 
made up example, there are real ones like this though)
Then even prefixquery itself will fail!

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781938#action_12781938
 ] 

Uwe Schindler commented on LUCENE-1458:
---

...not to talk about TermRangeQueries and NumericRangeQueries. They rely on 
String.compareTo like the current terms dict.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781935#action_12781935
 ] 

Robert Muir edited comment on LUCENE-1458 at 11/24/09 2:01 PM:
---

bq. Yes, this (customizing comparator for termrefs) would definitely be very 
advanced stuff... you'd have to create your own codec to do it. And we'd 
default to UTF16 sort order for back compat.

Agreed, changing the sort order breaks a lot of things (not just some crazy 
seeking around code that I write)

i.e. if 'ch' is a character in some collator and sorts b, before c (completely 
made up example, there are real ones like this though)
Then even prefixquery itself will fail!

edit: better example is french collation, where the weight of accent marks is 
done in reverse order. 
prefix query would make assumptions based on the prefix, which are wrong.

  was (Author: rcmuir):
bq. Yes, this (customizing comparator for termrefs) would definitely be 
very advanced stuff... you'd have to create your own codec to do it. And we'd 
default to UTF16 sort order for back compat.

Agreed, changing the sort order breaks a lot of things (not just some crazy 
seeking around code that I write)

i.e. if 'ch' is a character in some collator and sorts b, before c (completely 
made up example, there are real ones like this though)
Then even prefixquery itself will fail!
  
> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in

[jira] Created: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Simon Willnauer (JIRA)

Prepare CharArraySet for Unicode 4.0


 Key: LUCENE-2094
 URL: https://issues.apache.org/jira/browse/LUCENE-2094
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.9.1, 2.9, 2.4.1, 2.4, 2.3.2, 2.3.1, 2.3, 2.2, 2.1, 
2.0.0, 1.9, 2.3.3, 2.4.2, 2.9.2, 3.0, 3.0.1, 3.1
Reporter: Simon Willnauer
 Fix For: 3.1


CharArraySet does lowercaseing if created with the correspondent flag. This 
causes that  String / char[] with uncode 4 chars which are in the set can not 
be retrieved in "ignorecase" mode.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Simon Willnauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2094:


Attachment: LUCENE-2094.txt

This patch contains a testcase and a fixed CharArraySet. Yet this does not use 
Version to preserve compatibility. I bring this patch up to start the 
discussion how we should handle this particular case.
Using version would not be that much of an issue as all Analyzers using a 
CharArraySet do have the Version class already.


> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-24 Thread DM Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781947#action_12781947
 ] 

DM Smith commented on LUCENE-1458:
--

bq. Yes, this (customizing comparator for termrefs) would definitely be very 
advanced stuff... you'd have to create your own codec to do it. And we'd 
default to UTF16 sort order for back compat.

For those of us working on texts in all different kinds of languages, it should 
not be very advanced stuff. It should be stock Lucene. A default UCA comparator 
would be good. And a way to provide a locale sensitive UCA comparator would 
also be good.

My use case is that each Lucene index typically has a single language or at 
least has a dominant language.

bq. ...not to talk about TermRangeQueries and NumericRangeQueries. They rely on 
String.compareTo like the current terms dict.
I think that String.compareTo works correctly on UCA collation keys.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781950#action_12781950
 ] 

Robert Muir commented on LUCENE-2094:
-

Hi simon, at a glance your patch is ok.

I wonder though if we should try to consistently improve both this and 
LowerCaseFilter patch in the same way.
i have two ideas that might make it easier...? I am very inconsistent with 
these things myself so I guess we can try to make it consistent.

1.
{code}  
   for(int i=0;i= Character.MIN_SUPPLEMENTARY_CODE_POINT){
  ++i;
 }
  }
{code}

I wonder if instead loops like this should look like
{code}
 for (int i =0; i < len; ) {
  ...
  i += Character.charCount(codepoint);
 }
{code}

2. I wonder if we should even add an if (supplementary) for things like 
lowercasing.
toLowerCase(ch) and toLowerCase(int) are most likely the same code anyway, 
so we could just make the code easier to read.
{code}
for (int i = 0; i < len; ) {
 i += Character.toChars(arr, ... 
  Character.toLowerCase(
 Character.codePointAt(...)))
}
{code}


> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781953#action_12781953
 ] 

Robert Muir commented on LUCENE-1458:
-

bq. I think that String.compareTo works correctly on UCA collation keys.

No, because UCA collation keys are bytes :)
You are right that byte comparison on these keys works though.
But if we change the sort order like this, various components are not looking 
at keys, instead they are looking at the term text themselves.

I guess what I am saying is that there is a lot of assumptions in lucene right 
now, (prefixquery was my example) that look at term text and assume it is 
sorted in binary order.

bq. It should be stock Lucene
as much as I agree with you that default UCA should be "stock lucene" (with the 
capability to use an alternate locale or even tailored collator), this creates 
some practical problems, as mentioned above.
also the practical problem that collation in the JDK is poop and we would want 
ICU for good performance...


> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexib

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781959#action_12781959
 ] 

Simon Willnauer commented on LUCENE-2094:
-

Robert, I tried to make it consistent to the LowerCaseFilter issues but I would 
vote +1 for both! This makes it much cleaner but we need to change the 
LowerCaseFilter one too!
I will quickly change my patch.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781958#action_12781958
 ] 

Uwe Schindler commented on LUCENE-2094:
---

Maybe we put this into UnicodeUtils (handling of toLowerCase etc for char[]).

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781960#action_12781960
 ] 

Simon Willnauer commented on LUCENE-2094:
-

bq. Maybe we put this into UnicodeUtils (handling of toLowerCase etc for 
char[]). 
I think calling those 3 methods should be fine without a utils method. We will 
see how it goes until the "end" of this whole issues I might change my mind.

simon

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781962#action_12781962
 ] 

Robert Muir commented on LUCENE-2094:
-

Simon definitely, it is not a problem with your patch...
Thinking we can fix both to be clean.

btw, I have no idea if there is any performance difference between doing things 
this way.


> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Simon Willnauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2094:


Attachment: LUCENE-2094.txt

Changed loop to use Charater.charCount()

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781964#action_12781964
 ] 

Simon Willnauer commented on LUCENE-2094:
-

bq. btw, I have no idea if there is any performance difference between doing 
things this way.
The change to charCount is pretty much the same as the if statement - this at 
least would not kill any performance.
The increment by 2 should also not be an issue. it is slightly slower than a ++ 
but this will be fine I guess.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781965#action_12781965
 ] 

Robert Muir commented on LUCENE-2094:
-

simon yeah,

I guess what I don't know, is if in the JDK Character.foo(int) is the same 
underlying stuff as Character.foo(char)
in trunk ICU there is not even char-based methods, it is all int, where its a 
trie lookup, with a special fast-path array for linear access to Latin-1


> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781969#action_12781969
 ] 

Simon Willnauer commented on LUCENE-2094:
-

bq. I guess what I don't know, is if in the JDK Character.foo(int) is the same 
underlying stuff as Character.foo(char)
The JDK version of toLowerCase(char) for instance casts to int and calls the 
overloaded method.
public static boolean isLowerCase(char ch) {
return isLowerCase((int)ch);
}

That is the case all over the place as far as I can see.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781969#action_12781969
 ] 

Simon Willnauer edited comment on LUCENE-2094 at 11/24/09 3:00 PM:
---

bq. I guess what I don't know, is if in the JDK Character.foo(int) is the same 
underlying stuff as Character.foo(char)
The JDK version of toLowerCase(char) for instance casts to int and calls the 
overloaded method.
{code}
public static boolean isLowerCase(char ch) {
return isLowerCase((int)ch);
}
{code}

That is the case all over the place as far as I can see.

  was (Author: simonw):
bq. I guess what I don't know, is if in the JDK Character.foo(int) is the 
same underlying stuff as Character.foo(char)
The JDK version of toLowerCase(char) for instance casts to int and calls the 
overloaded method.
public static boolean isLowerCase(char ch) {
return isLowerCase((int)ch);
}

That is the case all over the place as far as I can see.
  
> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781971#action_12781971
 ] 

Robert Muir commented on LUCENE-2094:
-

Simon, yeah i just checked.
all the properties, behind the scenes are stored as int.
we shouldn't use any char-based methods pretending it will buy us any faster 
performance.
it will just make the code ugly and probably slower.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781971#action_12781971
 ] 

Robert Muir edited comment on LUCENE-2094 at 11/24/09 3:09 PM:
---

Simon, yeah i just checked.
all the properties, behind the scenes are stored as int.
we shouldn't use any char-based methods pretending it will buy us any faster 
performance.
it will just make the code ugly and probably slower.

slower meaning, the "if" itself in the lowercasefilter patch, it can now be 
removed.


  was (Author: rcmuir):
Simon, yeah i just checked.
all the properties, behind the scenes are stored as int.
we shouldn't use any char-based methods pretending it will buy us any faster 
performance.
it will just make the code ugly and probably slower.
  
> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Simon Willnauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2094:


Attachment: LUCENE-2094.txt

Added some more tests including single highsurrogate chars.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781980#action_12781980
 ] 

Simon Willnauer commented on LUCENE-2094:
-

question of the day - should we use Version or not :)



> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781983#action_12781983
 ] 

Uwe Schindler commented on LUCENE-2094:
---

It would not hurt, the Set is only used for analyzers that all take a version 
param... It is not really a public API.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2092) BooleanQuery.hashCode and equals ignore isCoordDisabled

2009-11-24 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-2092.


Resolution: Fixed

Fixed in trunk, 3.0.x branch, 2.9.x branch.  Thanks Hoss!

> BooleanQuery.hashCode and equals ignore isCoordDisabled
> ---
>
> Key: LUCENE-2092
> URL: https://issues.apache.org/jira/browse/LUCENE-2092
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, 
> 2.9, 2.9.1
>Reporter: Hoss Man
>Assignee: Michael McCandless
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2092.patch
>
>
> BooleanQuery.isCoordDisabled() is not considered by BooleanQuery's hashCode() 
> or equals() methods ... this can cause serious badness to happen when caching 
> BooleanQueries.
> bug traces back to at least 1.9

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781994#action_12781994
 ] 

Simon Willnauer commented on LUCENE-2094:
-

bq. It would not hurt, the Set is only used for analyzers that all take a 
version param... It is not really a public API. 
So the thing here is that lowercasing for supplementary characters does only 
apply to a hand ful of chars see this link 
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3ACase_Sensitive%3DTrue%3A]%26[^[\u-\u]]]&esc=on
Those characters are from the Deseret Alphabet (mormons) which means we are 
introducing a "pain in the neck" Version flag into CharArraySet for about 40 
chars which would be broken?! I don't see this here! Nothing personal related 
to the Deseret Alphabet or anyone who is using it but this seem a bit too much 
of a hassle. It would make the code very ugly though.

simon

 

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781995#action_12781995
 ] 

Robert Muir commented on LUCENE-2094:
-

Another option would be to list a back break in changes:

if you are indexing Deseret language, you should reindex.

we could remove the Version from LowerCaseFilter this way, too.
If you are indexing this language, things werent working right before so you 
surely wrote your own filters...?!

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781998#action_12781998
 ] 

Simon Willnauer commented on LUCENE-2094:
-

I would also break compat in LowerCaseFilter and bring out a large NOTE that if 
you index mormon you need to reindex.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782007#action_12782007
 ] 

Uwe Schindler commented on LUCENE-2094:
---

+1 for breaking backwards for these chars. From the web: there are only 4 books 
written in this charset (the books of mormon, see 
[http://en.wikipedia.org/wiki/Deseret_alphabet], 
[http://www.omniglot.com/writing/deseret.htm]), so it is rather seldom. People 
affected by this will for sure have their own analyzers.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782009#action_12782009
 ] 

Robert Muir commented on LUCENE-2094:
-

Simon, yeah. its tricky you know, like many suppl. char issues.

even if we provide perfect backwards compatibility with what 3.0 did, if you 
care about these languages, you *WANT* to reindex, because stuff wasn't working 
at all before.
and if you really care, you weren't using any of lucene's analysis components 
anyway (except maybe WhitespaceTokenizer).
For example, StandardAnalyzer currently discards these characters anyway.

but we don't want to screw over CJK users where things might have been "mostly" 
working before, either.
In this case, CJK is completely unaffected, I think we should not use version 
here or in any other lowercasing fixes, including LowerCaseFilter itself.


> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

2009-11-24 Thread Simon Willnauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reassigned LUCENE-2039:
---

Assignee: Simon Willnauer  (was: Grant Ingersoll)

Took over from Grant 

> Regex support and beyond in JavaCC QueryParser
> --
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch, 
> LUCENE-2039_field_ext.patch, LUCENE-2039_field_ext.patch
>
>
> Since the early days the standard query parser was limited to the queries 
> living in core, adding other queries or extending the parser in any way 
> always forced people to change the grammar file and regenerate. Even if you 
> change the grammar you have to be extremely careful how you modify the parser 
> so that other parts of the standard parser are affected by customisation 
> changes. Eventually you had to live with all the limitation the current 
> parser has like tokenizing on whitespaces before a tokenizer / analyzer has 
> the chance to look at the tokens. 
> I was thinking about how to overcome the limitation and add regex support to 
> the query parser without introducing any dependency to core. I added a new 
> special character that basically prevents the parser from interpreting any of 
> the characters enclosed in the new special characters. I choose the forward 
> slash  '/' as the delimiter so that everything in between two forward slashes 
> is basically escaped and ignored by the parser. All chars embedded within 
> forward slashes are treated as one token even if it contains other special 
> chars like * []?{} or whitespaces. This token is subsequently passed to a 
> pluggable "parser extension" with builds a query from the embedded string. I 
> do not interpret the embedded string in any way but leave all the subsequent 
> work to the parser extension. Such an extension could be another full 
> featured query parser itself or simply a ctor call for regex query. The 
> interface remains quiet simple but makes the parser extendible in an easy way 
> compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char 
> into the syntax but I guess that would not be that much of a deal as it is 
> reflected in the escape method though. It would truly be nice to have more 
> than once extension an have this even more flexible so treat this patch as a 
> kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK 
> version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ... 
> }
> {code}
> which I would like better as it would be more consistent with the idea of the 
> query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based 
> approach I guess I will add a second patch with regex in core soon too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782015#action_12782015
 ] 

Robert Muir commented on LUCENE-1458:
-

{quote}
So this is definitely a back compat problem. And, unfortunately, even
if we like the true codepoint sort order, it's not easy to switch to
in a back-compat manner because if we write new segments into an old
index, SegmentMerger will be in big trouble when it tries to merge two
segments that had sorted the terms differently.
{quote}

Mike, I think it goes well beyond this. 
I think sort order is an exceptional low-level case that can trickle all the 
way up high into the application layer (including user perception itself), and 
create bugs.
Does a non-technical user in Hong Kong know how many codepoints each ideograph 
they enter are? 
Should they care? They will just not understand if things are in different 
order.

I think we are stuck with UTF-16 without a huge effort, which would not be 
worth it in any case.


> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (

[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782015#action_12782015
 ] 

Robert Muir edited comment on LUCENE-1458 at 11/24/09 4:37 PM:
---

{quote}
So this is definitely a back compat problem. And, unfortunately, even
if we like the true codepoint sort order, it's not easy to switch to
in a back-compat manner because if we write new segments into an old
index, SegmentMerger will be in big trouble when it tries to merge two
segments that had sorted the terms differently.
{quote}

Mike, I think it goes well beyond this. 
I think sort order is an exceptional low-level case that can trickle all the 
way up high into the application layer (including user perception itself), and 
create bugs.
Does a non-technical user in Hong Kong know how many code units each ideograph 
they enter are? 
Should they care? They will just not understand if things are in different 
order.

I think we are stuck with UTF-16 without a huge effort, which would not be 
worth it in any case.


  was (Author: rcmuir):
{quote}
So this is definitely a back compat problem. And, unfortunately, even
if we like the true codepoint sort order, it's not easy to switch to
in a back-compat manner because if we write new segments into an old
index, SegmentMerger will be in big trouble when it tries to merge two
segments that had sorted the terms differently.
{quote}

Mike, I think it goes well beyond this. 
I think sort order is an exceptional low-level case that can trickle all the 
way up high into the application layer (including user perception itself), and 
create bugs.
Does a non-technical user in Hong Kong know how many codepoints each ideograph 
they enter are? 
Should they care? They will just not understand if things are in different 
order.

I think we are stuck with UTF-16 without a huge effort, which would not be 
worth it in any case.

  
> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
>

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782024#action_12782024
 ] 

Robert Muir commented on LUCENE-1606:
-

bq. Patch line 6025.

Thanks for reviewing the patch and catching this. I'm working on trying to 
finalize this.
It already works fine for trunk, but I don't want it to suddenly break with the 
flex branch, so I'm adding a lot of tests and improvements in that regard.
The current wildcard tests aren't sufficient anyway to tell if its really 
working.
Also, when Mike ported it to the flex branch, he reorganized some code some in 
a way that I think is better, so I want to tie that in too.


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782026#action_12782026
 ] 

Uwe Schindler commented on LUCENE-1606:
---

Did he changed the FilteredTermEnum.next() loops? if yes, maybe the better 
approach also works for NRQ. I am just interested, but had no time to 
thoroughly look into the latest changes.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782026#action_12782026
 ] 

Uwe Schindler edited comment on LUCENE-1606 at 11/24/09 5:09 PM:
-

Did he changed the FilteredTermEnum.next() loops? if yes, maybe the better 
approach also works for NRQ. I am just interested, but had no time to 
thoroughly look into the latest changes.

I am still thinking about an extension of FilteredTermEnum that works with 
these repositioning out of the box. But I have no good idea. The work in 
FilteredTerm*s*Enum is a good start, but may be extended, to also support 
something like a return value "JUMP_TO_NEXT_ENUM" and a mabstract method 
"nextEnum()" that returns null per default (no further enum).

  was (Author: thetaphi):
Did he changed the FilteredTermEnum.next() loops? if yes, maybe the better 
approach also works for NRQ. I am just interested, but had no time to 
thoroughly look into the latest changes.
  
> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782029#action_12782029
 ] 

Robert Muir commented on LUCENE-1606:
-

No, the main thing he did here that i like better, is that instead of caching 
the last comparison in termCompare(), he uses a boolean 'first'

This still gives the optimization of 'don't seek in the term dictionary unless 
you get a mismatch, as long as you have matches, read sequentially'
But in my opinion, its cleaner.


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782033#action_12782033
 ] 

Uwe Schindler commented on LUCENE-1606:
---

OK, so doesn't affect NRQ, as it uses a different algo

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

2009-11-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782035#action_12782035
 ] 

Michael McCandless commented on LUCENE-2075:


{quote}
bq. I am quite sure that also Robert's test is random (as he explained).

It's not random - it's the specified pattern, parsed to
WildcardQuery, run 10 times, then take best or avg time.
{quote}

Woops -- I was wrong here -- Robert's test is random: on each iteration, it 
replaces any N's in the pattern w/ a random number 0-9.

Still baffled on why the linear scan shows gains w/ the cache... digging.

> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782036#action_12782036
 ] 

Robert Muir commented on LUCENE-1606:
-

Yeah, but in general I think I already agree that FilteredTerm*s*Enum is easier 
for stuff like this.

Either way its still tricky to make enums like this, so I am glad you are 
looking into it.


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782039#action_12782039
 ] 

Robert Muir commented on LUCENE-2075:
-

bq. Woops - I was wrong here - Robert's test is random: on each iteration, it 
replaces any N's in the pattern w/ a random number 0-9.

Yeah, the terms are equally distributed 000-999 though, just a "fill"
The wildcard patterns themselves are filled with random numbers.

This is my basis for the new wildcard test btw, except maybe 1-10k, definitely 
want over 8192 :)
unless you have better ideas?

> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782042#action_12782042
 ] 

Uwe Schindler commented on LUCENE-1606:
---

I think the approach with nextEnum() would work for Automaton and NRQ, because 
both use this iteration approach. You have nextString() for repositioning, and 
I have a LinkedList (a stack) of pre-sorted range bounds.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782043#action_12782043
 ] 

Robert Muir commented on LUCENE-1606:
-

And I could still use this with "dumb mode"?, just one enum, right?

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Hudson Account for me refused

2009-11-24 Thread Uwe Schindler

Hi others,

I was trying to get an account for hudson, but it was refused:
https://issues.apache.org/jira/browse/INFRA-2326

As far as I know, other only-committers of Lucene-Java have already one, so
what should I do?

If somebody with Hudson account would at least beable to change the build
properties to use version 3.1-dev. The nightly svn target was already
changed.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

2009-11-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782046#action_12782046
 ] 

Michael McCandless commented on LUCENE-2075:


bq. This is my basis for the new wildcard test btw, except maybe 1-10k, 
definitely want over 8192

Sounds great :)

> Share the Term -> TermInfo cache across threads
> ---
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782056#action_12782056
 ] 

Uwe Schindler commented on LUCENE-1606:
---

yes.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)

2009-11-24 Thread Andi Vajda



 Hi Uwe,

On Sun, 22 Nov 2009, Uwe Schindler wrote:


I have built the artifacts for the final release of "Apache Lucene Java
3.0.0" a second time, because of a bug in the TokenStream API (found by Shai
Erera, who wanted to make "bad" things with addAttribute, breaking its
behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to prevent
stack overflow, LUCENE-2087). They are targeted for release on 2009-11-25.

The artifacts are here:
http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/


The artifacts you've prepared don't correspond to the HEAD of the 
lucene_3_0 branch anymore since fixes for bugs 2086 and 2092 were added.


Could you please add a lucene_3_0_0 tag that corresponds to the artifacts ? 
This makes it easier to build a PyLucene with Lucene Java sources equivalent 
to these artifacts, using Lucene Java's svn.


Of course, if another revision of these artifacts ends up being made, the 
tag should then move accordingly but, at this point, it's just missing.


Thanks !

Andi..



You find the changes in the corresponding sub folder. The SVN revision is
883080, here the manifest with build system info:

Manifest-Version: 1.0
Ant-Version: Apache Ant 1.7.0
Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.)
Specification-Title: Lucene Search Engine
Specification-Version: 3.0.0
Specification-Vendor: The Apache Software Foundation
Implementation-Title: org.apache.lucene
Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49
Implementation-Vendor: The Apache Software Foundation
X-Compile-Source-JDK: 1.5
X-Compile-Target-JDK: 1.5

Please vote to officially release these artifacts as "Apache Lucene Java
3.0.0".

We need at least 3 binding (PMC) votes.

Thanks everyone for all their hard work on this and I am very sorry for
requesting a vote again, but that's life! Thanks Shai for the pointer to the
bug!




Here is the proposed release note, please edit, if needed:
--

Hello Lucene users,

On behalf of the Lucene dev community (a growing community far larger than
just the committers) I would like to announce the release of Lucene Java
3.0:

The new version is mostly a cleanup release without any new features. All
deprecations targeted to be removed in version 3.0 were removed. If you are
upgrading from version 2.9.1 of Lucene, you have to fix all deprecation
warnings in your code base to be able to recompile against this version.

This is the first Lucene release with Java 5 as a minimum requirement. The
API was cleaned up to make use of Java 5's generics, varargs, enums, and
autoboxing. New users of Lucene are advised to use this version for new
developments, because it has a clean, type safe new API. Upgrading users can
now remove unnecessary casts and add generics to their code, too. If you
have not upgraded your installation to Java 5, please read the file
JRE_VERSION_MIGRATION.txt (please note that this is not related to Lucene
3.0, it will also happen with any previous release when you upgrade your
Java environment).

Lucene 3.0 has some changes regarding compressed fields: 2.9 already
deprecated compressed fields; support for them was removed now. Lucene 3.0
is still able to read indexes with compressed fields, but as soon as merges
occur or the index is optimized, all compressed fields are decompressed and
converted to Field.Store.YES. Because of this, indexes with compressed
fields can suddenly get larger.

While we generally try and maintain full backwards compatibility between
major versions, Lucene 3.0 has some minor breaks, mostly related to
deprecation removal, pointed out in the 'Changes in backwards compatibility
policy' section of CHANGES.txt. Notable are:

- IndexReader.open(Directory) now opens in read-only mode per default (this
method was deprecated because of that in 2.9). The same occurs to
IndexSearcher.

- Already started in 2.9, core TokenStreams are now made final to enforce
the decorator pattern.

- If you interrupt an IndexWriter merge thread, IndexWriter now throws an
unchecked ThreadInterruptedException that extends RuntimeException and
clears the interrupt status.

--



Thanks,
Uwe


-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Hudson Account for me refused

2009-11-24 Thread Uwe Schindler

Here is the rationale for that:

http://mail-archives.apache.org/mod_mbox/www-builds/200911.mbox/%3c4B0C1563.
6050...@apache.org%3e


-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Uwe Schindler [mailto:u...@thetaphi.de]
> Sent: Tuesday, November 24, 2009 6:32 PM
> To: java-dev@lucene.apache.org
> Subject: Hudson Account for me refused
> 
> Hi others,
> 
> I was trying to get an account for hudson, but it was refused:
> https://issues.apache.org/jira/browse/INFRA-2326
> 
> As far as I know, other only-committers of Lucene-Java have already one,
> so
> what should I do?
> 
> If somebody with Hudson account would at least beable to change the build
> properties to use version 3.1-dev. The nightly svn target was already
> changed.
> 
> Uwe
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Hudson Account for me refused

2009-11-24 Thread Mark Miller

Yeah, I've seen these rejections before - I don't think the rule makes
any sense, but they only given Hudson accounts to PMC members.

Uwe Schindler wrote:
> Here is the rationale for that:
>
> http://mail-archives.apache.org/mod_mbox/www-builds/200911.mbox/%3c4B0C1563.
> 6050...@apache.org%3e
>
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>   
>> -Original Message-
>> From: Uwe Schindler [mailto:u...@thetaphi.de]
>> Sent: Tuesday, November 24, 2009 6:32 PM
>> To: java-dev@lucene.apache.org
>> Subject: Hudson Account for me refused
>>
>> Hi others,
>>
>> I was trying to get an account for hudson, but it was refused:
>> https://issues.apache.org/jira/browse/INFRA-2326
>>
>> As far as I know, other only-committers of Lucene-Java have already one,
>> so
>> what should I do?
>>
>> If somebody with Hudson account would at least beable to change the build
>> properties to use version 3.1-dev. The nightly svn target was already
>> changed.
>>
>> Uwe
>>
>> -
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>> 
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>   


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)

2009-11-24 Thread Uwe Schindler

Hi Andi

I will add the tag, when it is officially voted for release. If we respin,
the tag would be incorrect (and must be removed and recreated). The release
todo clearly says, that the tag should be added when all votes are there,
and all other did this like this before.

Just one more day and I will create the tag (if I get 2 more votes).

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Andi Vajda [mailto:va...@osafoundation.org]
> Sent: Tuesday, November 24, 2009 6:46 PM
> To: java-dev@lucene.apache.org
> Subject: Re: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)
> 
> 
>   Hi Uwe,
> 
> On Sun, 22 Nov 2009, Uwe Schindler wrote:
> 
> > I have built the artifacts for the final release of "Apache Lucene Java
> > 3.0.0" a second time, because of a bug in the TokenStream API (found by
> Shai
> > Erera, who wanted to make "bad" things with addAttribute, breaking its
> > behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to
> prevent
> > stack overflow, LUCENE-2087). They are targeted for release on 2009-11-
> 25.
> >
> > The artifacts are here:
> > http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/
> 
> The artifacts you've prepared don't correspond to the HEAD of the
> lucene_3_0 branch anymore since fixes for bugs 2086 and 2092 were added.
> 
> Could you please add a lucene_3_0_0 tag that corresponds to the artifacts
> ?
> This makes it easier to build a PyLucene with Lucene Java sources
> equivalent
> to these artifacts, using Lucene Java's svn.
> 
> Of course, if another revision of these artifacts ends up being made, the
> tag should then move accordingly but, at this point, it's just missing.
> 
> Thanks !
> 
> Andi..
> 
> >
> > You find the changes in the corresponding sub folder. The SVN revision
> is
> > 883080, here the manifest with build system info:
> >
> > Manifest-Version: 1.0
> > Ant-Version: Apache Ant 1.7.0
> > Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.)
> > Specification-Title: Lucene Search Engine
> > Specification-Version: 3.0.0
> > Specification-Vendor: The Apache Software Foundation
> > Implementation-Title: org.apache.lucene
> > Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49
> > Implementation-Vendor: The Apache Software Foundation
> > X-Compile-Source-JDK: 1.5
> > X-Compile-Target-JDK: 1.5
> >
> > Please vote to officially release these artifacts as "Apache Lucene Java
> > 3.0.0".
> >
> > We need at least 3 binding (PMC) votes.
> >
> > Thanks everyone for all their hard work on this and I am very sorry for
> > requesting a vote again, but that's life! Thanks Shai for the pointer to
> the
> > bug!
> >
> >
> >
> >
> > Here is the proposed release note, please edit, if needed:
> > 
> --
> >
> > Hello Lucene users,
> >
> > On behalf of the Lucene dev community (a growing community far larger
> than
> > just the committers) I would like to announce the release of Lucene Java
> > 3.0:
> >
> > The new version is mostly a cleanup release without any new features.
> All
> > deprecations targeted to be removed in version 3.0 were removed. If you
> are
> > upgrading from version 2.9.1 of Lucene, you have to fix all deprecation
> > warnings in your code base to be able to recompile against this version.
> >
> > This is the first Lucene release with Java 5 as a minimum requirement.
> The
> > API was cleaned up to make use of Java 5's generics, varargs, enums, and
> > autoboxing. New users of Lucene are advised to use this version for new
> > developments, because it has a clean, type safe new API. Upgrading users
> can
> > now remove unnecessary casts and add generics to their code, too. If you
> > have not upgraded your installation to Java 5, please read the file
> > JRE_VERSION_MIGRATION.txt (please note that this is not related to
> Lucene
> > 3.0, it will also happen with any previous release when you upgrade your
> > Java environment).
> >
> > Lucene 3.0 has some changes regarding compressed fields: 2.9 already
> > deprecated compressed fields; support for them was removed now. Lucene
> 3.0
> > is still able to read indexes with compressed fields, but as soon as
> merges
> > occur or the index is optimized, all compressed fields are decompressed
> and
> > converted to Field.Store.YES. Because of this, indexes with compressed
> > fields can suddenly get larger.
> >
> > While we generally try and maintain full backwards compatibility between
> > major versions, Lucene 3.0 has some minor breaks, mostly related to
> > deprecation removal, pointed out in the 'Changes in backwards
> compatibility
> > policy' section of CHANGES.txt. Notable are:
> >
> > - IndexReader.open(Directory) now opens in read-only mode per default
> (this
> > method was deprecated because of that in 2.9). The same occurs to
> > IndexSearcher.
> >
> > - Already started in 2.9, core TokenStreams are now

RE: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)

2009-11-24 Thread Andi Vajda



On Tue, 24 Nov 2009, Uwe Schindler wrote:


I will add the tag, when it is officially voted for release. If we respin,
the tag would be incorrect (and must be removed and recreated). The release
todo clearly says, that the tag should be added when all votes are there,
and all other did this like this before.

Just one more day and I will create the tag (if I get 2 more votes).


So I'm in a catch-22. I was going to vote if I could build a PyLucene from 
this and pass all PyLucene tests :)


Do you happen to know what svn rev was used to build the artifacts ?
I could use that rev instead of HEAD.

Andi..



Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



-Original Message-
From: Andi Vajda [mailto:va...@osafoundation.org]
Sent: Tuesday, November 24, 2009 6:46 PM
To: java-dev@lucene.apache.org
Subject: Re: [VOTE] Release Apache Lucene Java 3.0.0 (take #2)


  Hi Uwe,

On Sun, 22 Nov 2009, Uwe Schindler wrote:


I have built the artifacts for the final release of "Apache Lucene Java
3.0.0" a second time, because of a bug in the TokenStream API (found by

Shai

Erera, who wanted to make "bad" things with addAttribute, breaking its
behaviour, LUCENE-2088) and an improvement in NumericRangeQuery (to

prevent

stack overflow, LUCENE-2087). They are targeted for release on 2009-11-

25.


The artifacts are here:
http://people.apache.org/~uschindler/staging-area/lucene-3.0.0-take2/


The artifacts you've prepared don't correspond to the HEAD of the
lucene_3_0 branch anymore since fixes for bugs 2086 and 2092 were added.

Could you please add a lucene_3_0_0 tag that corresponds to the artifacts
?
This makes it easier to build a PyLucene with Lucene Java sources
equivalent
to these artifacts, using Lucene Java's svn.

Of course, if another revision of these artifacts ends up being made, the
tag should then move accordingly but, at this point, it's just missing.

Thanks !

Andi..



You find the changes in the corresponding sub folder. The SVN revision

is

883080, here the manifest with build system info:

Manifest-Version: 1.0
Ant-Version: Apache Ant 1.7.0
Created-By: 1.5.0_22-b03 (Sun Microsystems Inc.)
Specification-Title: Lucene Search Engine
Specification-Version: 3.0.0
Specification-Vendor: The Apache Software Foundation
Implementation-Title: org.apache.lucene
Implementation-Version: 3.0.0 883080 - 2009-11-22 15:52:49
Implementation-Vendor: The Apache Software Foundation
X-Compile-Source-JDK: 1.5
X-Compile-Target-JDK: 1.5

Please vote to officially release these artifacts as "Apache Lucene Java
3.0.0".

We need at least 3 binding (PMC) votes.

Thanks everyone for all their hard work on this and I am very sorry for
requesting a vote again, but that's life! Thanks Shai for the pointer to

the

bug!




Here is the proposed release note, please edit, if needed:


--


Hello Lucene users,

On behalf of the Lucene dev community (a growing community far larger

than

just the committers) I would like to announce the release of Lucene Java
3.0:

The new version is mostly a cleanup release without any new features.

All

deprecations targeted to be removed in version 3.0 were removed. If you

are

upgrading from version 2.9.1 of Lucene, you have to fix all deprecation
warnings in your code base to be able to recompile against this version.

This is the first Lucene release with Java 5 as a minimum requirement.

The

API was cleaned up to make use of Java 5's generics, varargs, enums, and
autoboxing. New users of Lucene are advised to use this version for new
developments, because it has a clean, type safe new API. Upgrading users

can

now remove unnecessary casts and add generics to their code, too. If you
have not upgraded your installation to Java 5, please read the file
JRE_VERSION_MIGRATION.txt (please note that this is not related to

Lucene

3.0, it will also happen with any previous release when you upgrade your
Java environment).

Lucene 3.0 has some changes regarding compressed fields: 2.9 already
deprecated compressed fields; support for them was removed now. Lucene

3.0

is still able to read indexes with compressed fields, but as soon as

merges

occur or the index is optimized, all compressed fields are decompressed

and

converted to Field.Store.YES. Because of this, indexes with compressed
fields can suddenly get larger.

While we generally try and maintain full backwards compatibility between
major versions, Lucene 3.0 has some minor breaks, mostly related to
deprecation removal, pointed out in the 'Changes in backwards

compatibility

policy' section of CHANGES.txt. Notable are:

- IndexReader.open(Directory) now opens in read-only mode per default

(this

method was deprecated because of that in 2.9). The same occurs to
IndexSearcher.

- Already started in 2.9, core TokenStreams are now made final to

enforce

the decorator pattern.

- If you interrupt an I

1 2 >

1 - 100 of 150 matches

Mail list logo