[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783584#action_12783584
 ] 

Uwe Schindler commented on LUCENE-1458:
---

I rewrote the NumericRangeTermsEnum, see revision 885360.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

2009-11-30 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783588#action_12783588
 ] 

Simon Willnauer commented on LUCENE-2039:
-

The contrib/regex dependency on contrib/misc buggs me a bit though. I have the 
impression that this regex default extension should not be part of this patch. 
The extension seems to be so trivial that users could implement it on their 
own. This would save us the dependency and IMO would not be a problem for users 
though.

 Any thoughts?

> Regex support and beyond in JavaCC QueryParser
> --
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch, 
> LUCENE-2039_field_ext.patch, LUCENE-2039_field_ext.patch, 
> LUCENE-2039_field_ext.patch
>
>
> Since the early days the standard query parser was limited to the queries 
> living in core, adding other queries or extending the parser in any way 
> always forced people to change the grammar file and regenerate. Even if you 
> change the grammar you have to be extremely careful how you modify the parser 
> so that other parts of the standard parser are affected by customisation 
> changes. Eventually you had to live with all the limitation the current 
> parser has like tokenizing on whitespaces before a tokenizer / analyzer has 
> the chance to look at the tokens. 
> I was thinking about how to overcome the limitation and add regex support to 
> the query parser without introducing any dependency to core. I added a new 
> special character that basically prevents the parser from interpreting any of 
> the characters enclosed in the new special characters. I choose the forward 
> slash  '/' as the delimiter so that everything in between two forward slashes 
> is basically escaped and ignored by the parser. All chars embedded within 
> forward slashes are treated as one token even if it contains other special 
> chars like * []?{} or whitespaces. This token is subsequently passed to a 
> pluggable "parser extension" with builds a query from the embedded string. I 
> do not interpret the embedded string in any way but leave all the subsequent 
> work to the parser extension. Such an extension could be another full 
> featured query parser itself or simply a ctor call for regex query. The 
> interface remains quiet simple but makes the parser extendible in an easy way 
> compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char 
> into the syntax but I guess that would not be that much of a deal as it is 
> reflected in the escape method though. It would truly be nice to have more 
> than once extension an have this even more flexible so treat this patch as a 
> kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK 
> version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ... 
> }
> {code}
> which I would like better as it would be more consistent with the idea of the 
> query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based 
> approach I guess I will add a second patch with regex in core soon too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2094:


Attachment: LUCENE-2094.patch

I updated the patch to use Version in StopFilter. This seems to be reasonable 
though.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783596#action_12783596
 ] 

Michael McCandless commented on LUCENE-1458:


Thanks Uwe!

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783598#action_12783598
 ] 

Michael McCandless commented on LUCENE-1458:


{quote}
fwiw here is a patch to use the algorithm from the unicode std for utf8 in 
utf16 sort order.
they claim it is fast because there is no conditional branching... who knows
{quote}
We could try to test to see if we see a difference in practice...

For term text without surrogate content, the branch always goes one way, so the 
CPU ought to predict it well and it may turn out to be faster using branching.

With surrogates, likely the lookup approach is faster since the branch has good 
chance of going either way.

However, the lookup approach adds 256 bytes to CPUs memory cache, which I'm not 
thrilled about.  We have other places that do the same (NORM_TABLE in 
Similarity, scoreCache in TermScorer), that I think are much more warranted to 
make the time vs cache line tradeoff since they deal with a decent amount of 
CPU.

Or maybe worrying about cache lines from way up in javaland is just silly ;)

I guess at this point I'd lean towards keeping the branch based comparator.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, s

Re: Socket and file locks

2009-11-30 Thread Michael McCandless
That pages looks awesome -- thanks for contributing it!

Mike

On Sun, Nov 29, 2009 at 6:06 PM, Sanne Grinovero
 wrote:
> Hello,
>
> I'm glad you appreciate it; I've added the Wiki page here:
> http://wiki.apache.org/lucene-java/AvailableLockFactories
>
> I avoided on purpose to copy-paste the full javadocs of each
> implementation as that would be out-of-date or too specific to some
> version, I limited myself to writing some words to highlight the
> differences as a quick overview of what is available.
> hope you like it, I'm open to suggestions.
>
> Regards,
> Sanne
>
>
> 2009/11/29 Michael McCandless :
>> This looks great!
>>
>> Maybe it makes most sense to create a wiki page
>> (http://wiki.apache.org/lucene-java) for interesting LockFactory
>> implementations/tradeoffs, and add this there?
>>
>> Mike
>>
>> On Sat, Nov 28, 2009 at 9:26 AM, Sanne Grinovero
>>  wrote:
>>> Hello,
>>> Together with the Infinispan Directory we developed such a
>>> LockFactory; I'd me more than happy if you wanted to add some pointers
>>> to it in the Lucene documention/readme.
>>> This depends on Infinispan for multiple-machines communication
>>> (JGroups, indirectly) but
>>> it's not required to use an Infinispan Directory, you could combine it
>>> with a Directory impl of choice.
>>> This was tested with the LockVerifyServer mentioned by Michael
>>> McCandless and also
>>> with some other tests inspired from it (in-VM for lower delay
>>> coordination and verify, while the LockFactory was forced to
>>> use real network communication).
>>>
>>> While this is a technology preview and performance regarding the
>>> Directory code is still unknown, I believe the LockFactory was the
>>> most tested component.
>>>
>>> free to download and inspect (LGPL):
>>> http://anonsvn.jboss.org/repos/infinispan/trunk/lucene-directory/
>>>
>>> Regards,
>>> Sanne
>>>
>>> 2009/11/27 Michael McCandless :
 I think a LockFactory for Lucene that implemented the ideas you &
 Marvin are discussing in LUCENE-1877,  and/or the approach you
 implemented in the H2 DB, would be a useful addition to Lucene!

 For many apps, the simple LockFactory impls suffice, but for apps
 where multiple machines can become the writer, it gets hairy.  Having
 an always correct Lock impl for these apps would be great.

 Note that Lucene has some basic tools (in oal.store) for asserting
 that a LockFactory is correct (see LockVerifyServer), so it's a useful
 way to test that things are working from Lucene's standpoint.

 Mike

 On Fri, Nov 27, 2009 at 9:23 AM, Thomas Mueller
  wrote:
> Hi,
>
> I'm wondering if your are interested in automatically releasing the
> write lock. See also my comments on
> https://issues.apache.org/jira/browse/LUCENE-1877 - I thought it's a
> problem worth solving, because it's also in the Lucene FAQ list at
> http://wiki.apache.org/lucene-java/LuceneFAQ#What_is_the_purpose_of_write.lock_file.2C_when_is_it_used.2C_and_by_which_classes.3F
>
> Unfortunately there seems to be no solution that 'always works', but
> delegating the task and responsibility to the application / to the
> user is problematic as well. For example, a user of the H2 database
> (that supports Lucene fulltext indexing) suggested to automatically
> remove the write.lock file whenever the file is there:
> http://code.google.com/p/h2database/issues/detail?id=141 - sounds a
> bit dangerous in my view.
>
> So, if you are interested to solve the problem, then maybe I can help.
> If not, then I will not bother you any longer :-)
>
> Regards,
> Thomas
>
>
>
>> > > shouldn't active code like that live in the application layer?
>> > Why?
>> You can all but guarantee that polling will work at the app layer
>
> The application layer may also run with low priority. In operating
> systems, it's usually the lower layer that have more 'rights'
> (priority), and not the higher levels (I'm not saying it should be
> like that in Java). I just think the application layer should not have
> to deal with write locks or removing write locks.
>
>> by the time the original process realizes that it doesn't hold the lock 
>> anymore, the damage could already have been done.
>
> Yes, I'm not sure how to best avoid that (with any design). Asking the
> application layer or the user whether the lock file can be removed is
> probably more dangerous than trying the best in Lucene.
>
> Standby / hibernate: the question is, if the machine process is
> currently not running, does the process still hold the lock? I think
> no, because the machine might as well turned off. How to detect
> whether the machine is turned off versus in hibernate mode? I guess
> that's a problem for all mechanisms (socket / file lock / background
> thread).
>
> When a hibernated pro

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783654#action_12783654
 ] 

Robert Muir commented on LUCENE-1458:
-

bq. We could try to test to see if we see a difference in practice...

it is also very wierd to me that the method you are using is the one being used 
in ICU... if this one is faster why isnt ICU using it?
its also sketchy that the table as described in the unicode std doesn't even 
work anyway as described... so is anyone using it?

I like your reasoning, lets leave it alone for now... other things to work on 
that will surely help.


> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783661#action_12783661
 ] 

Robert Muir commented on LUCENE-1606:
-

Mike,

This is ok for the trunk, but I have a question about \u in flex (I guess 
we do not need to figure it out now, just think about it).
My understanding is that now \u can be in the index, and I can seek to it 
(it won't get replaced with \uFFFD).
>From your comment this seems undefined at the moment, but for this enum I need 
>to know, otherwise it will either skip \u terms, or go into a loop.



> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783697#action_12783697
 ] 

Michael McCandless commented on LUCENE-1606:


bq. My understanding is that now \u can be in the index, and I can seek to 
it (it won't get replaced with \uFFFD).

Yes, \u should be untouched now (though I haven't verified  -- actually 
I'll go add it to the test we already have for \u).

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-11-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783701#action_12783701
 ] 

Robert Muir commented on LUCENE-1606:
-

Thanks Mike, I will change the enum to reflect this.
Currently I cheat and take advantage of this property (in trunk) to make the 
code simpler.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

2009-11-30 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783737#action_12783737
 ] 

DM Smith commented on LUCENE-2034:
--

I was trying to lurk, but I'm not able to apply the latest patch against trunk. 
I'm not sure if its me (using Eclipse) or the patch.

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, 
> LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses 
> need to implement at least one of the methodes returning a tokenStream. When 
> you look at the code it appears to be almost identical if both are 
> implemented in the same analyzer.  Each analyzer defnes the same inner class 
> (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his 
> own way of loading them or defines a large number of ctors to load stopwords 
> from a file, set, arrays etc.. those ctors should be removed / deprecated and 
> eventually removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

2009-11-30 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783741#action_12783741
 ] 

Simon Willnauer commented on LUCENE-2034:
-

bq. I was trying to lurk, but I'm not able to apply the latest patch against 
trunk. I'm not sure if its me (using Eclipse) or the patch. 
its most likely the patch. There is so much going on around the analyzers right 
now. We try to get LUCENE-2094 in and get this ready once it is in. I will 
update this patch soon.

> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -
>
> Key: LUCENE-2034
> URL: https://issues.apache.org/jira/browse/LUCENE-2034
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2034,patch, LUCENE-2034,patch, LUCENE-2034.patch, 
> LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.txt
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses 
> need to implement at least one of the methodes returning a tokenStream. When 
> you look at the code it appears to be almost identical if both are 
> implemented in the same analyzer.  Each analyzer defnes the same inner class 
> (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his 
> own way of loading them or defines a large number of ctors to load stopwords 
> from a file, set, arrays etc.. those ctors should be removed / deprecated and 
> eventually removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2094:


Attachment: LUCENE-2094.patch

I hope we made it with this patch - don't want to keep this growing. 
I fixed a problem in CharArraySet (equals / getHashCode) with limits which is 
also the reason why CharacterUtils now has a codePointAt(char[], offset, limit) 
method.
This patch also moves Version into StopFilter but exposes an expert ctor to set 
the posInc manually.

happy reviewing

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reassigned LUCENE-2094:
-

Assignee: Uwe Schindler

I take this one as communicated in private chat.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783757#action_12783757
 ] 

Uwe Schindler commented on LUCENE-2094:
---

bq. This patch also moves Version into StopFilter but exposes an expert ctor to 
set the posInc manually. 

As discussed before, please deprecate this. The posIncr stuff was deprecated 
everywhere else too (in 2.9 already).

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783760#action_12783760
 ] 

Robert Muir commented on LUCENE-2094:
-

bq. As discussed before, please deprecate this. The posIncr stuff was 
deprecated everywhere else too (in 2.9 already).

I think i disagree, only because Solr StopFilterFactory allows the user to 
explicitly set this.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783761#action_12783761
 ] 

Simon Willnauer commented on LUCENE-2094:
-

bq. I think i disagree, only because Solr StopFilterFactory allows the user to 
explicitly set this.
+1

bq. As discussed before, please deprecate this. The posIncr stuff was 
deprecated everywhere else too (in 2.9 already).
Except of StopFilter, this class had a none-deprected  posInc constructor. I 
also thing this one should be accessible and not deprecated.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783762#action_12783762
 ] 

Robert Muir commented on LUCENE-2094:
-

fwiw Solr uses .setEnablePositionIncrements method to accomplish this.

So to me it doesn't matter, as there is a way to explicity do this, I think?

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783765#action_12783765
 ] 

Uwe Schindler commented on LUCENE-2094:
---

...Solr can also use the version to do this.

We removed the posIncr ctors also for a lot of Analyzers (StandardAnalzer,...), 
so why not also remove (deprecate and remove in 4.0)  from StopFilter? There is 
another issue open, that says: Remove all per-instance setters and make all 
filters final (I think it was the hell issue). All parameters should be passed 
on ctor and that prferably using version. Alternate an ctor only taking 
booleans, but most of the were removed in 3.0. The only relict in core is 
StopFilter.

So the ctor taking version should not make posIncr available and the other way 
round. If you want to control the falgs yourself, create an ctor with posIncr 
and smartjava5unicode switches (ugly).

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783770#action_12783770
 ] 

Robert Muir commented on LUCENE-2094:
-

Uwe, yeah as long as they have some way to do it.

I guess I don't understand if users view this posInc flag / versioning thing as 
really itself an option, and the version use is just about having just a better 
default? Or if its considered a bug that posInc wasn't working before. I think 
there are some tradeoffs in behavior between the two and I'm not sure one size 
fits all.

its not clear from the Solr issue that added this option either.


> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783772#action_12783772
 ] 

Uwe Schindler commented on LUCENE-2094:
---

It is recommeneded to turn it on. But if its off, it must be also disabled in 
the QueryParser. Becazuse of that QP now also have a version ctor.

But you are right, Mike preserved the setPositionIncrement method in QP.

At other places like in StandardAnalyzer there is no longer a posIncr setting, 
and that's good!

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783773#action_12783773
 ] 

Uwe Schindler commented on LUCENE-2094:
---

bq. its not clear from the Solr issue that added this option either.

Because they wanted to preserve bw compatibility for old indexes. And at this 
time Version was not available. Newer versions of Solr should just add a 
property to their factories giving the version (or a global solr option 
automatically applied to the whole Solr installation; that how I do it in 
panFMP, my own Solr-like project).

bq. Or if its considered a bug that posInc wasn't working before. 

It is a bug.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783784#action_12783784
 ] 

Yonik Seeley commented on LUCENE-2094:
--

Preserving gaps from stopword removal isn't always desirable... seems it should 
remain an option to enable/disable it.
No biggie if y'all don't agree in Lucene land though - Solr's factory could 
just switch between alternate classes to enable/disable position increments.

Solr's query parser (the one that just extends Lucene's QueryParser) always 
enables position increments.  That allows the true control to rest with the 
filters for specific fields.

bq. It is recommeneded to turn it on. But if its off, it must be also disabled 
in the QueryParser.

Why?  What undesirable things happen if the QueryParser has 
enablePositionIncrements(true) with a StopFilter that doesn't produce gaps?

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783789#action_12783789
 ] 

Uwe Schindler commented on LUCENE-2094:
---

bq. Why? What undesirable things happen if the QueryParser has 
enablePositionIncrements(true) with a StopFilter that doesn't produce gaps?

We coupled it to Version in 2.9. If you create the StopFilter with 
Version.LUCENE_29 it is enabled. If you pass this version to QP, it's enabled, 
too. Very simple?

Solr should make Version a property to all factories and create all 
Filters/Parsers using that flag. Thats why we implemented Version (to get rid 
of all these strange boolean flags). Just use Version.valueOf(property) and use 
the result to create your filters. It is now implemented everywhere in Lucene 
Core and Contrib.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783789#action_12783789
 ] 

Uwe Schindler edited comment on LUCENE-2094 at 11/30/09 6:55 PM:
-

bq. Why? What undesirable things happen if the QueryParser has 
enablePositionIncrements(true) with a StopFilter that doesn't produce gaps?

We coupled it to Version in 2.9. If you create the StopFilter with 
Version.LUCENE_29 it is enabled. If you pass this version to QP, it's enabled, 
too. Very simple?

Solr should make Version a property to all factories and create all 
Filters/Parsers using that flag. Thats why we implemented Version (to get rid 
of all these strange boolean flags). Just use Version.valueOf(property) and use 
the result to create your filters. It is now implemented everywhere in Lucene 
Core and Contrib (Version.valueOf() would not work in 2.9, because Version 
extends Parameter there, but in 3.0 it's an enum)

  was (Author: thetaphi):
bq. Why? What undesirable things happen if the QueryParser has 
enablePositionIncrements(true) with a StopFilter that doesn't produce gaps?

We coupled it to Version in 2.9. If you create the StopFilter with 
Version.LUCENE_29 it is enabled. If you pass this version to QP, it's enabled, 
too. Very simple?

Solr should make Version a property to all factories and create all 
Filters/Parsers using that flag. Thats why we implemented Version (to get rid 
of all these strange boolean flags). Just use Version.valueOf(property) and use 
the result to create your filters. It is now implemented everywhere in Lucene 
Core and Contrib.
  
> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783796#action_12783796
 ] 

Yonik Seeley commented on LUCENE-2094:
--

> > Why? What undesirable things happen if the QueryParser has 
> > enablePositionIncrements(true) with a StopFilter that doesn't produce gaps?

> We coupled it to Version in 2.9. If you create the StopFilter with 
> Version.LUCENE_29 it is enabled. If you pass this version to QP, it's 
> enabled, too. Very simple?

I'm still failing to see why it shouldn't just always be enabled in the query 
parser.  Solr forces it to always be enabled.  Will this cause a bug in any 
scenarios?

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783806#action_12783806
 ] 

Uwe Schindler commented on LUCENE-2094:
---

Yes it causes. If you have an old index without posincr, the query parser would 
produce queries that do not work (we had this issue in 2.9.1 shortly before 
release, one of the reasons why it was delayed).

The version flag is for backwards compatibility. If you do not reinex with a 
new Version constant you should use the old version constant everywhere and 
things will play happy together. Even solr users will have old indexes, and for 
them there should be a property to specify the version constant (using this 
valueOf of enums). Solr should then create all components that require a 
version (and since 3.0 *all* analyzers need this) using this property. And then 
everything will play wonderful together (anayzers, query parser and so on).

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783807#action_12783807
 ] 

Simon Willnauer commented on LUCENE-2094:
-

bq. I'm still failing to see why it shouldn't just always be enabled in the 
query parser. Solr forces it to always be enabled. Will this cause a bug in any 
scenarios?

it won't cause any bugs as far as I can see. The root cause for all this 
compatibility - we try hard to preserve bw compat with version all over the 
place. The reason for this setter is more or less some kind of "expert 
convenience" My personal feeling would be to make it always true / let version 
do it.



> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783806#action_12783806
 ] 

Uwe Schindler edited comment on LUCENE-2094 at 11/30/09 7:25 PM:
-

Yes it causes. If you have an old index without posincr, the query parser would 
produce queries that do not work (we had this issue in 2.9.1 shortly before 
release, one of the reasons why it was delayed).

The version flag is for backwards compatibility. If you do not reinex with a 
new Version constant you should use the old version constant everywhere and 
things will play happy together. Even solr users will have old indexes, and for 
them there should be a property to specify the version constant (using this 
valueOf of enums). Solr should then create all components that require a 
version (and since 3.0 *all* analyzers need this) using this property. And then 
everything will play wonderful together (anayzers, query parser and so on).

Also Highlighter had a problem with it (same issue with the QP problem in 
pre-2.9.1)!

  was (Author: thetaphi):
Yes it causes. If you have an old index without posincr, the query parser 
would produce queries that do not work (we had this issue in 2.9.1 shortly 
before release, one of the reasons why it was delayed).

The version flag is for backwards compatibility. If you do not reinex with a 
new Version constant you should use the old version constant everywhere and 
things will play happy together. Even solr users will have old indexes, and for 
them there should be a property to specify the version constant (using this 
valueOf of enums). Solr should then create all components that require a 
version (and since 3.0 *all* analyzers need this) using this property. And then 
everything will play wonderful together (anayzers, query parser and so on).
  
> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2094:


Attachment: LUCENE-2094.patch

Changed the StopFilter(..,posInc,..) ctor to private for convenience.



> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2094:


Attachment: LUCENE-2094.patch

updated patch to trunk - uwe on heavy committing

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783829#action_12783829
 ] 

Yonik Seeley commented on LUCENE-2094:
--

bq. Yes it causes. If you have an old index without posincr, the query parser 
would produce queries that do not work

Oh, wait, is this because things like StandardAnalyzer changed the default?  
Seems like that's where the back comat break should have been addressed... 
water under the bridge at this point though.


> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783829#action_12783829
 ] 

Yonik Seeley edited comment on LUCENE-2094 at 11/30/09 8:15 PM:


bq. Yes it causes. If you have an old index without posincr, the query parser 
would produce queries that do not work

Oh, wait, is this because things like StandardAnalyzer changed the default?  
Seems like that's where the back comat break should have been addressed (and it 
was)... water under the bridge at this point though.

  was (Author: ysee...@gmail.com):
bq. Yes it causes. If you have an old index without posincr, the query 
parser would produce queries that do not work

Oh, wait, is this because things like StandardAnalyzer changed the default?  
Seems like that's where the back comat break should have been addressed... 
water under the bridge at this point though.

  
> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783834#action_12783834
 ] 

Uwe Schindler commented on LUCENE-2094:
---

No they did not break. If you use Version.LUCENE_24 in the ctor of 
StandardAnalyzer it behaves like in 2.4. Because of that we have Version! We 
preserver BW comp by *requiring* a matchVersion parameter to *all* ctors of 
Analyzers.

In 2.9 the deprecated non-version ctors default to version 2.4 (from 3.0 on you 
*have to* specify the version)

If you use always Version.LUCENE_CURRENT then you have to reindex after each 
version upgrade.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783842#action_12783842
 ] 

DM Smith commented on LUCENE-2094:
--

bq. If you create the StopFilter with Version.LUCENE_29 it is enabled. If you 
pass this version to QP, it's enabled, too. Very simple?

Yes. But, IMHO, this seems like advocating that a desired behavior be gained by 
a backward compatibility mechanism.

I see two problems in using Version to enable position increments (or any other 
particular behavior):
a) If a prior behavior is desired now, one should not need to use a prior 
Version to get it.
b) Version codifies a particular combination of behavior. It does not allow for 
rolling one's combination.

Make that 3 problems:
c) At some point a prior version's behavior will/should be removed.

It seems like this was discussed at length for creating a Settings object. I'd 
rather see Attribute/AttributeSources used for such a thing than Version.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783848#action_12783848
 ] 

Michael McCandless commented on LUCENE-2094:


I think if indeed there are valid reasons to have StopFilter throw away the 
holes, then, we shouldn't hide this setting behind Version.  Ie, we should keep 
the explicit setters / separate param to ctor.  So I think that's the 
question... is it a bug or a feature?



> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783855#action_12783855
 ] 

Uwe Schindler commented on LUCENE-2094:
---

I will commit the patc now and we can later think about undeprecating. Simon 
wants to go forward with other patches and there are heavy cahnges in it, so I 
need to do heavy committing.

The discussion should have been here before 2.9, because most individual 
setters are now removed, this is the really only relict. All others are 
subsumed under version.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783861#action_12783861
 ] 

Uwe Schindler edited comment on LUCENE-2094 at 11/30/09 9:52 PM:
-

Committed revision: 885592

I keep this open for further discussing. The Version ctor param is now 
everywhere and it is better than goiving a boolean to *every* analyzer that 
uses StopFilter. And that was the reason for creating the Version constants in 
2.9.

bq. So I think that's the question... is it a bug or a feature?

It is a bug.  Everybody should update the code and raise the version constant 
to 31.

  was (Author: thetaphi):
Committed revision: 885592

I keep this open for further discussing. The Version ctor param is now 
everywhere and it is better than goiving a boolean to *every* analyzer that 
uses StopFilter. And that was the reason for creating the Version constants in 
2.9.

bq. So I think that's the question... is it a bug or a feature? Everybody 
should update the code and raise the version constant to 31

It is a bug.
  
> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783861#action_12783861
 ] 

Uwe Schindler commented on LUCENE-2094:
---

Committed revision: 885592

I keep this open for further discussing. The Version ctor param is now 
everywhere and it is better than goiving a boolean to *every* analyzer that 
uses StopFilter. And that was the reason for creating the Version constants in 
2.9.

bq. So I think that's the question... is it a bug or a feature? Everybody 
should update the code and raise the version constant to 31

It is a bug.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2094:
--

Affects Version/s: (was: 3.0.1)
   (was: 2.9.2)
   (was: 2.9.1)
   (was: 3.1)
   (was: 2.4.2)
   (was: 2.4.1)
   (was: 2.3.3)
   (was: 2.3.2)
   (was: 2.3.1)
   (was: 2.9)
   (was: 2.4)
   (was: 2.3)
   (was: 2.2)
   (was: 2.1)
   (was: 2.0.0)
   (was: 1.9)

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783863#action_12783863
 ] 

Mark Miller commented on LUCENE-2094:
-

bq. It is a bug. 

It was never considered a bug before. It was well known - its in Lucene In 
Action that you can leave gaps if you'd like to.

bq. Committed revision: 885592

Sucks to rush a commit when an issue is under discussion. Easy to say we can 
come back to this, easy not to. I'm against such heavy committing myself, 
without some consensus to do so. In the old days, there was a bias towards not 
committing.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783868#action_12783868
 ] 

Uwe Schindler commented on LUCENE-2094:
---

Sorry, the commit is not the problem.

We are discussing only about one line, if we deprecate the explicit boolean arg 
or not. The rest of the patch is not affected, Because of that I committed, 
because Robert and Simon want to go forward with other analyzer/unicode work.

So this commit does not remove anything. And Version was introduced already in 
2.9.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783871#action_12783871
 ] 

Robert Muir commented on LUCENE-2094:
-

Hello, my proposal is still the same:
* we use Version to control StopFilter's *default* behavior
* we deprecate the static getDefault... method and the setter
* we add an explicit, even "expert" if  you want, ctor that still uses Version, 
but also has this boolean param.

this would mean we do not have to have a boolean in all of our analyzers (it is 
just Version)
also that there is no setter behavior (i do not like these from a ts 
reusability perspective)
and finally that people still get to change to non-default behavior for this 
param if they want.

I'm sorry i havent been able to keep up with this today (busy), but if there's 
consensus I will create the patch, etc.
I think all we have to do is change one of Simon's ctors from private to public 
and add javadocs.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783872#action_12783872
 ] 

Mark Miller commented on LUCENE-2094:
-

bq. We are discussing only about one line, if we deprecate the explicit boolean 
arg or not. 

One line thats part of this patch. By committing, you remove incentive to deal 
with the issue as the patch works in line with this being a bug. Now its in the 
code, now everyone can go home and forget.

bq. because Robert and Simon want to go forward with other analyzer/unicode 
work.

Whats the rush? They can do other work without this being in trunk today. Thats 
not a valid reason for any commit in my mind.

bq. So this commit does not remove anything. And Version was introduced already 
in 2.9.

I don't think it matters - where is the consensus to do this commit now after 
discussion around it (one line or not) started? I don't see it.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783878#action_12783878
 ] 

Mark Miller commented on LUCENE-2094:
-

IMHO the strongest this should have happened is: you propose that the current 
discussion is not pertinent to committing this patch. You then ask what others 
think about committing and keeping it open. You then say, if no one objects, 
you will commit in a day or two. I'm against quick commits like this.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783882#action_12783882
 ] 

Uwe Schindler commented on LUCENE-2094:
---

The term "heavy committing" is just a joke, I did not do any real cuncontrolled 
heavy committing today, I just called it so, because the patch was very large 
and affected lot's of files. We are sorry; Robert, Simon and me were chatting 
private a lot in parallel and came to the opinion, that we should commit this 
first and then discuss about this one ctor more. Discussing in this JIRA issue 
is a pain because of long page loading time.

The addition of Version to StopFilter was agreed already, the only thing was 
the deprecation of the boolean flag. Let's open another issue for it and solve 
it separate. We should have opened another issue for it already, but we 
merged/developed both patches (add matchVersion to CharArraySet and 
StopFilter), as always the same files were affected and because of that not so 
many patches can get out of sync.

So I am sorry for lot's of commits today!

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783895#action_12783895
 ] 

Uwe Schindler commented on LUCENE-2094:
---

bq. I'm against quick commits like this. 

Revert?

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2094:


Attachment: LUCENE-2094.patch

attached is my proposal mentioned in the comments above.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783901#action_12783901
 ] 

Uwe Schindler commented on LUCENE-2094:
---

That's much easier to discuss, everybody sees in one small patch whats 
happending. The other one was too big and unrelated. A new issue would have 
been better at all.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783904#action_12783904
 ] 

Michael McCandless commented on LUCENE-2094:


bq.  Robert, Simon and me were chatting private a lot in parallel and came to 
the opinion, that we should commit this first and then discuss about this one 
ctor more.

Really discussions like this should happen in public.

bq. Discussing in this JIRA issue is a pain because of long page loading time.

We can carry it over to java-dev, in general.  I agree page load time gets 
annoying for big issues...

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783906#action_12783906
 ] 

Robert Muir commented on LUCENE-2094:
-

bq. Really discussions like this should happen in public.

Actually, what I asked Uwe was, if he could take this issue for me, since I 
will be busy at work this week and its holding Simon up.


> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783913#action_12783913
 ] 

Michael McCandless commented on LUCENE-2094:


I personally don't like that StopFilter can discard the holes.  It
loses information, that you can never get back, once indexed.

That said, it's clearly not black & white -- enough people feel it's a
feature (not a bug), and should be preserved, so I think we should
preserve it as a standalone option.

But I think we should keep the default as "don't discard the holes".

{quote}
Hello, my proposal is still the same:
  * we use Version to control StopFilter's default behavior
  * we deprecate the static getDefault... method and the setter
  * we add an explicit, even "expert" if you want, ctor that still uses 
Version, but also has this boolean param.
{quote}

I think this is a good approach!


> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783915#action_12783915
 ] 

Uwe Schindler commented on LUCENE-2094:
---

And Simon wanted to also work on the "massive code duplication" (LUCENE-2034)  
issue which would break this patch and vice versa. You never get these two 
patches to merge, because the code duplication issue does large refactoring of 
almost all analyzers. And xmas is coming, so we want to have a nice xmas 
present for all analyzer writers...

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783917#action_12783917
 ] 

Robert Muir commented on LUCENE-2094:
-

Mike, in my opinion the holes can have some impact on phrase queries.
Personally I think the situation is complex (and I hate to say but language 
dependent), but I think "holes" are a good default.

But I should be able to change them explicitly, overriding the default.
The Version should allow us having the capability to change defaults (while 
still providing options), not just fix bugs

if anyone has time to glance at the patch, let me know what you think. We don't 
have to deprecate the setter, thats just me being anal.


> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783920#action_12783920
 ] 

Uwe Schindler commented on LUCENE-2094:
---

I like the patch if it solves this issue and we are all happy again. The 
updates to javadocs are also fine, the 2.9 thing was missing.

The problem in this issue was, that some participants were not fully informed 
about the Version parameter at all and that it prevents from breaking 
backwards. My opinion is: Please also add matchVersion as a factory property 
for query parser and analyzers! Solr would profit from it, too. Fewer options 
and you can preserve your config file even after a major Solr update without 
breaking any existing indexes. That is the lesson out of this issue.

Discussing about a separate get/set for this posIncr stuff is another 
discussion for a separate issue.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783921#action_12783921
 ] 

Michael McCandless commented on LUCENE-2094:


bq. The Version should allow us having the capability to change defaults (while 
still providing options), not just fix bugs

Right, Version allows us to change defaults and fix bugs w/o breaking back 
compat.

The patch looks good to me, and I think deprecating the setter makes sense -- 
being able to specify this on ctor is enough.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783922#action_12783922
 ] 

Michael McCandless commented on LUCENE-2094:


bq. Mike, in my opinion the holes can have some impact on phrase queries.

But if the PhraseQuery is generated with QueryParser also preserving holes, 
then it works properly?

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783924#action_12783924
 ] 

Robert Muir commented on LUCENE-2094:
-

bq. But if the PhraseQuery is generated with QueryParser also preserving holes, 
then it works properly?

what is "properly" ?

If I search on english for "book for sale", it will match "books for sale"
this is considered ok for english.

If I am using persian analyzer, such a thing will not work, because the plural 
form of book (کتاب) is formed by adding an additional word afterwards (کتاب ها).

So the way plural forms get "stemmed" to their singular form in persian is 
implemented with stopwords (ها is in the list). I think this is a clean simple 
approach, which is why I did it this way.

For english, its attached to the word with an s... should we bump the posinc 
gap after stemmed words in english too?

So you see, I think its dependent upon language and how you want the 
application to work.


> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783926#action_12783926
 ] 

Uwe Schindler commented on LUCENE-2094:
---

bq. But if the PhraseQuery is generated with QueryParser also preserving holes, 
then it works properly?

Yes, I tested this before 2.9.1 (one reason why you had to respin).

QueryParser also still has the get/set for posIncr but also takes the 
matchVersion. Here it is the other way round, the ctor uses the default with 
Version and you can change it by a setter later (which is still not deprecated 
and available in 3.0).

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783926#action_12783926
 ] 

Uwe Schindler edited comment on LUCENE-2094 at 11/30/09 11:08 PM:
--

bq. But if the PhraseQuery is generated with QueryParser also preserving holes, 
then it works properly?

Yes, I tested this before 2.9.1 (one reason why you had to respin).

QueryParser also still has the get/set for posIncr but also takes the 
matchVersion. Here it is the other way round, the ctor uses the default with 
Version and you can change it by a setter later (which is still not deprecated 
and available in 3.0).

In my opinion we should go that way (which is against Robert's opinion). The 
ctor taking two booleans is very bad...

  was (Author: thetaphi):
bq. But if the PhraseQuery is generated with QueryParser also preserving 
holes, then it works properly?

Yes, I tested this before 2.9.1 (one reason why you had to respin).

QueryParser also still has the get/set for posIncr but also takes the 
matchVersion. Here it is the other way round, the ctor uses the default with 
Version and you can change it by a setter later (which is still not deprecated 
and available in 3.0).
  
> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783930#action_12783930
 ] 

Uwe Schindler commented on LUCENE-2094:
---

bq. So the way plural forms get "stemmed" to their singular form in persian is 
implemented with stopwords (ها is in the list). I think this is a clean simple 
approach, which is why I did it this way.

But if this is so, you should have initialized the stop filter in persian 
analyzer with a fixed "false". Bt it also used 
StopFilter.getEnablePositionIncrementsVersionDefault() and used the version 
default. Should we fix this?

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783932#action_12783932
 ] 

Robert Muir commented on LUCENE-2094:
-

bq. But if this is so, you should have initialized the stop filter in persian 
analyzer with a fixed "false". Bt it also used 
StopFilter.getEnablePositionIncrementsVersionDefault() and used the version 
default. Should we fix this?

I don't think so. I think its up to the user to decide how they want the search 
to work, even in this example.
If they don't like the defaults for how phrasequery works, they can create an 
analyzer that uses the stopfilter differently.

I don't think the issue is clear for any given language, I think it always 
depends on how your application works.
I mean we add a hole for "the" in english, but in bulgarian (LUCENE-2062) this 
is a suffix attached to the end of a noun.
With arabic its always a prefix. I don't think we need to have options to add a 
posinc gap if we stem leading ال off an arabic word.

I'm just trying to show some examples of why a user might want to change the 
defaults.


> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783934#action_12783934
 ] 

Uwe Schindler commented on LUCENE-2094:
---

A godd idea might be to use two StopFilters:
- One with the real stop words that use the Version-default setting for posIncr
- One for the plural suffixes and so on, that should simply be removed. This 
StopFilter would use false for posIncr.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783938#action_12783938
 ] 

Robert Muir commented on LUCENE-2094:
-

bq. A godd idea might be to use two StopFilters: 

in theory, but sometimes these terms are ambiguous, and the computer 
(especially a very simple analyzer) does not know which one it is, sometimes it 
can be both. 

sometimes its a real word too, but on average its better to ignore it.

I don't think we need to go to this effort optimal phrasequeries either. A user 
who really cares can do this themself... and thats my whole point, they should 
be able to do something liek what you said, and explicitly say 'no i don't want 
posIncr for this stopfilter, but yes I'll take the real bugfixes, thanks'

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783954#action_12783954
 ] 

Michael McCandless commented on LUCENE-2094:


bq. So you see, I think its dependent upon language and how you want the 
application to work.

OK, indeed, the issue is not simple -- thanks for the examples!

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783959#action_12783959
 ] 

Uwe Schindler commented on LUCENE-2094:
---

Before I go to bed, how about ctor vs get/set:
QueryParser currently only has a ctor taking matchVersion which sets the 
default. If somebody wants to change the default, he can later call 
setEnablePositionIncrements().
In my opinion, this is more clear than supplying both in one ctor (they are two 
params that seem to interact with each other, but they don't!). I would also 
prefer to initialize StopFilter with the defaults in the ctor, and later change 
it using setters.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783963#action_12783963
 ] 

Robert Muir commented on LUCENE-2094:
-

bq. In my opinion, this is more clear than supplying both in one ctor (they are 
two params that seem to interact with each other, but they don't!). I would 
also prefer to initialize StopFilter with the defaults in the ctor, and later 
change it using setters.

if this is better, then we need not do anything (except I still think we should 
fix up some minor unrelated javadocs problems i had in the patch). The setter 
is not deprecated currently.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783967#action_12783967
 ] 

Uwe Schindler commented on LUCENE-2094:
---

Oh, you are right! All posIncr ctors are deprecated, the matchVersion ones 
bringing defaults are the new ones. And you can change this default later - 
prefect. Just more documentation! :-)

+1 from my side.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783974#action_12783974
 ] 

Robert Muir commented on LUCENE-2094:
-

Uwe, the problem I think is still what DM/Mike said before:

{quote}
I think if indeed there are valid reasons to have StopFilter throw away the 
holes, then, we shouldn't hide this setting behind Version. Ie, we should keep 
the explicit setters / separate param to ctor. So I think that's the 
question... is it a bug or a feature?
{quote}



> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784023#action_12784023
 ] 

Mark Miller commented on LUCENE-2094:
-

bq. Revert?

Not from me. I don't think its a huge deal, certainly not something that 
requires a revert. I just worry some times about the pace of things - and that 
the more active one becomes, the more they/we should worry about allowing 
proper time for comments/objections when committing. I like how people have 
tended to err on the side of waiting for solid consensus myself. Its not a big 
issue here - but as we move away from that I think it will be. Its something 
that will spread as new users look at old users when determining how they act.

The more heavy committing one does, the easier I think it is to just decide 
stuff and cram it in - personally (and I'm just one voice).  The more you do, I 
think its also more important to allow brief time periods between saying what 
you are going to do and doing it (though that should always be done). Its easy 
to say, well we can just change it, or pull it out - but with lazy consensus 
and how the community works, I think thats conducive to worse code. Its much 
easier for someone to debate and have questions than it is to hound changes or 
code out of trunk. In my mind its better if the bottleneck is on the going in, 
as it has been, rather then shifting things to fixing whats in. Especially if 
there is debate in an issue still - whether it belongs there or not - I think 
there should be warning and consensus before a commit.

I realize thats a bit of a tough sell based on this little issue alone - but 
its a general feeling I've been having as lucene dev has really been ramping up 
in recent times. I think its important we stick to being conservative about 
waiting for consensus - giving others a chance to voice their opinion - no 
matter how sure you are about your decision. I think its an important example 
for new users, and an important characteristic of Lucene development.

Thats just me though - I don't speak for anyone but myself.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-30 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784025#action_12784025
 ] 

Mark Miller commented on LUCENE-2094:
-

bq. The term "heavy committing" is just a joke

I know, I wasn't using it very seriously myself ;)

bq. So I am sorry for lot's of commits today!

No worries - I don't mean to frame anything is a way that you should have to 
apologize for. Lots of commits are still good from my point view! I just think 
there should be something of a warning before a commit in an issue that is 
being actively discussed. 

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, 
> LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, 
> LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Build failed in Hudson: Lucene-trunk #1014

2009-11-30 Thread Apache Hudson Server
See 

Changes:

[uschindler] LUCENE-2094: Prepare CharArraySet for Unicode 4.0

[uschindler] LUCENE-1844: Fix problem in windows, because the path separator \ 
is also an escape for properties files. Simple fix is to replace all backslashs 
by forward slashes in the getReuters20File.

[uschindler] fix javadoc

[uschindler] fix javadocs

--
[...truncated 2471 lines...]

jflex-uptodate-check:

jflex-notice:

init:

init-dist:
[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 


package-tgz-src:
  [tar] Building tar: 

 [gzip] Building: 


BUILD SUCCESSFUL
Total time: 28 seconds
+ cp dist/lucene-2009-12-01_02-03-50-src.tar.gz 

+ /export/home/hudson/tools/ant/latest/bin/ant -lib 
/export/home/nigel/hudsonSupport/maven 
-Dsvnversion.exe=/opt/subversion-current/bin/svnversion 
-Dsvn.exe=/opt/subversion-current/bin/svn -Dversion=3.1-SNAPSHOT 
generate-maven-artifacts
Buildfile: build.xml

maven.ant.tasks-check:

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

init:

clover.setup:

clover.info:

clover:

compile-core:
[mkdir] Created dir: 

[javac] Compiling 393 source files to 

[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.

jar-core:
  [jar] Building jar: 


build-contrib:

common:
 [echo] Building analyzers...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

init:

clover.setup:

clover.info:

clover:

compile-core:

compile-test:
[mkdir] Created dir: 

[javac] Compiling 209 source files to 

[javac] An exception has occurred in the compiler (1.5.0_14). Please file a 
bug at the Java Developer Connection (http://java.sun.com/webapps/bugreport)  
after checking the Bug Parade for duplicates. Include your program and the 
following diagnostic in your report.  Thank you.
[javac] java.lang.AssertionError: {unused}
[javac] at 
com.sun.tools.javac.tree.TreeMaker$AnnotationBuilder.visitArray(TreeMaker.java:634)
[javac] at 
com.sun.tools.javac.code.Attribute$Array.accept(Attribute.java:124)
[javac] at 
com.sun.tools.javac.tree.TreeMaker$AnnotationBuilder.translate(TreeMaker.java:637)
[javac] at 
com.sun.tools.javac.tree.TreeMaker$AnnotationBuilder.visitCompoundInternal(TreeMaker.java:628)
[javac] at 
com.sun.tools.javac.tree.TreeMaker$AnnotationBuilder.translate(TreeMaker.java:641)
[javac] at 
com.sun.tools.javac.tree.TreeMaker.Annotation(TreeMaker.java:649)
[javac] at 
com.sun.tools.javac.tree.TreeMaker.Annotations(TreeMaker.java:570)
[javac] at com.sun.tools.javac.tree.TreeMaker.VarDef(TreeMaker.java:554)
[javac] at 
com.sun.tools.javac.comp.Lower.visitIterableForeachLoop(Lower.java:2892)
[javac] at 
com.sun.tools.javac.comp.Lower.visitForeachLoop(Lower.java:2755)
[javac] at 
com.sun.tools.javac.tree.Tree$ForeachLoop.accept(Tree.java:597)
[javac] at com.sun.tools.javac.comp.Lower.translate(Lower.java:1881)
[javac] at 
com.sun.tools.javac.tree.TreeTranslator.translate(TreeTranslator.java:54)
[javac] at 
com.sun.tools.javac.tree.TreeTranslator.visitBlock(TreeTranslator.java:145)
[javac] at com.sun.tools.javac.comp.Lower.visitBlock(Lower.java:2927)
[javac] at com.sun.tools.javac.tree.Tree$Block.accept(Tree.java:535)
[javac] at com.sun.tools.javac.comp.Lower.translate(Lower.java:1881)
[javac] at 
com.sun.tools.javac.comp.Lower.visitWhileLoop(Lower.java:2939)
[javac] at com.sun.tools.javac.tree.Tree$WhileLoop.accept(Tree.java:563)
[javac] at com.sun.tools.javac.comp.Lower.translate(Lower.java:1881)
[javac] at 
com.sun.tools.javac.tree.TreeTranslator.translate(TreeTranslator.java:54)
[javac

[jira] Created: (LUCENE-2098) make BaseCharFilter more efficient in performance

2009-11-30 Thread Koji Sekiguchi (JIRA)
make BaseCharFilter more efficient in performance
-

 Key: LUCENE-2098
 URL: https://issues.apache.org/jira/browse/LUCENE-2098
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.9
Reporter: Koji Sekiguchi
Priority: Minor


Performance degradation in Solr 1.4 was reported. See:

http://www.lucidimagination.com/search/document/43c4bdaf5c9ec98d/html_stripping_slower_in_solr_1_4

The inefficiency has been pointed out in BaseCharFilter javadoc by Mike:

{panel}
NOTE: This class is not particularly efficient. For example, a new class 
instance is created for every call to addOffCorrectMap(int, int), which is then 
appended to a private list. 
{panel}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

2009-11-30 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1458:
--

Attachment: LUCENE-1458-NRQ.patch

To prevent problems like yesterday, he is the patch I applied yesterday to the 
flex branch (for completeness).

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-NRQ.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458_rotate.patch, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org