Re: Build failed in Hudson: Lucene-trunk #1064

2010-01-16 Thread Michael McCandless
The Berkeley DB jars failed to download again... this strikes us every so often:

get-je-jar:
   [mkdir] Created dir:

 [get] Getting: http://download.oracle.com/berkeley-db/je-3.3.69.zip
 [get] To: 

 [get] Error getting
http://download.oracle.com/berkeley-db/je-3.3.69.zip to




We have this issue open to figure out how to more reliably host it:

http://issues.apache.org/jira/browse/LUCENE-1845

Mike

On Fri, Jan 15, 2010 at 9:27 PM, Apache Hudson Server
 wrote:
> See 
>
> Changes:
>
> [uschindler] move changes.txt entry into contrib
>
> [uschindler] LUCENE-2211: Fix various missing clearAttributes() and improve 
> BaseTokenStreamTestCase to check for this trap
>
> --
> [...truncated 3921 lines...]
>
> init:
>
> clover.setup:
>
> clover.info:
>
> clover:
>
> compile-core:
>
> compile:
>
> compile-highlighter:
>     [echo] Building highlighter...
>
> build-memory:
>
> build-regex:
>
> javacc-uptodate-check:
>
> javacc-notice:
>
> jflex-uptodate-check:
>
> jflex-notice:
>
> common.init:
>
> build-lucene:
>
> build-lucene-tests:
>
> init:
>
> clover.setup:
>
> clover.info:
>
> clover:
>
> common.compile-core:
>
> compile-core:
>
> compile:
>
> compile-vector-highlighter:
>     [echo] Building fast-vector-highlighter...
>
> build-analyzers:
>     [echo] Fast Vector Highlighter building dependency 
> 
>
> common:
>     [echo] Building analyzers...
>
> javacc-uptodate-check:
>
> javacc-notice:
>
> jflex-uptodate-check:
>
> jflex-notice:
>
> common.init:
>
> build-lucene:
>
> build-lucene-tests:
>
> init:
>
> clover.setup:
>
> clover.info:
>
> clover:
>
> compile-core:
>
> jar-core:
>      [jar] Building jar: 
> 
>
> default:
>
> smartcn:
>     [echo] Building smartcn...
>
> javacc-uptodate-check:
>
> javacc-notice:
>
> jflex-uptodate-check:
>
> jflex-notice:
>
> common.init:
>
> build-lucene:
>
> build-lucene-tests:
>
> init:
>
> clover.setup:
>
> clover.info:
>
> clover:
>
> compile-core:
>
> jar-core:
>      [jar] Building jar: 
> 
>
> default:
>
> default:
>
> javacc-uptodate-check:
>
> javacc-notice:
>
> jflex-uptodate-check:
>
> jflex-notice:
>
> common.init:
>
> build-lucene:
>
> build-lucene-tests:
>
> init:
>
> clover.setup:
>
> clover.info:
>
> clover:
>
> common.compile-core:
>
> compile-core:
>
> compile:
>
> check-files:
>
> init:
>
> clover.setup:
>
> clover.info:
>
> clover:
>
> compile-core:
>
> common.compile-test:
>    [mkdir] Created dir: 
> 
>    [javac] Compiling 12 source files to 
> 
>    [javac] Note: 
> 
>  uses or overrides a deprecated API.
>    [javac] Note: Recompile with -Xlint:deprecation for details.
>     [copy] Copying 3 files to 
> 
>
> build-artifacts-and-tests:
>
> bdb:
>     [echo] Building bdb...
>
> javacc-uptodate-check:
>
> javacc-notice:
>
> jflex-uptodate-check:
>
> jflex-notice:
>
> common.init:
>
> build-lucene:
>
> build-lucene-tests:
>
> contrib-build.init:
>
> get-db-jar:
>    [mkdir] Created dir: 
> 
>      [get] Getting: http://downloads.osafoundation.org/db/db-4.7.25.jar
>      [get] To: 
> 
>
> check-and-get-db-jar:
>
> init:
>
> clover.setup:
>
> clover.info:
>
> clover:
>
> compile-core:
>    [mkdir] Created dir: 
> 
>    [javac] Compiling 7 source files to 
> 
>
> jar-core:
>      [jar] Building jar: 
> 
>
> default

[jira] Updated: (LUCENE-2209) add @experimental javadocs tag

2010-01-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2209:
---

Attachment: LUCENE-2209-internal.patch

Patch to add @lucene.internal tags to some classes (mostly in oal.util.*).

> add @experimental javadocs tag
> --
>
> Key: LUCENE-2209
> URL: https://issues.apache.org/jira/browse/LUCENE-2209
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Javadocs
>Reporter: Robert Muir
> Attachments: LUCENE-2209-internal.patch, LUCENE-2209.patch, 
> LUCENE-2209.patch
>
>
> There are a lot of things marked experimental, api subject to change, etc. in 
> lucene.
> this patch simply adds a @experimental tag to common-build.xml so that we can 
> use it, for more consistency.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2209) add @experimental javadocs tag

2010-01-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801149#action_12801149
 ] 

Uwe Schindler commented on LUCENE-2209:
---

Maybe we should make this messages in font color="red" like before?

With the utils classes, some of them are not really internal because also used 
in custom query impls like the ToStringUtils, which is in all of my toString() 
methods of my own Query classes.

> add @experimental javadocs tag
> --
>
> Key: LUCENE-2209
> URL: https://issues.apache.org/jira/browse/LUCENE-2209
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Javadocs
>Reporter: Robert Muir
> Attachments: LUCENE-2209-internal.patch, LUCENE-2209.patch, 
> LUCENE-2209.patch
>
>
> There are a lot of things marked experimental, api subject to change, etc. in 
> lucene.
> this patch simply adds a @experimental tag to common-build.xml so that we can 
> use it, for more consistency.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2209) add @experimental javadocs tag

2010-01-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801150#action_12801150
 ] 

Robert Muir commented on LUCENE-2209:
-

bq. Maybe we should make this messages in font color="red" like before? 

I like just the bold myself, I think it stands out enough.


> add @experimental javadocs tag
> --
>
> Key: LUCENE-2209
> URL: https://issues.apache.org/jira/browse/LUCENE-2209
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Javadocs
>Reporter: Robert Muir
> Attachments: LUCENE-2209-internal.patch, LUCENE-2209.patch, 
> LUCENE-2209.patch
>
>
> There are a lot of things marked experimental, api subject to change, etc. in 
> lucene.
> this patch simply adds a @experimental tag to common-build.xml so that we can 
> use it, for more consistency.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Dawid Weiss (JIRA)
OpenBitSet#hashCode() may return false for identical sets.
--

 Key: LUCENE-2216
 URL: https://issues.apache.org/jira/browse/LUCENE-2216
 Project: Lucene - Java
  Issue Type: Bug
  Components: Other
Affects Versions: 3.0, 2.9.1, 2.9
Reporter: Dawid Weiss
Priority: Minor


OpenBitSet uses an internal buffer of long variables to store set bits and an 
additional 'wlen' index that points 
to the highest used component inside {...@link #bits} buffer.

Unlike in JDK, the wlen field is not continuously maintained (on clearing bits, 
for example). This leads to a situation when wlen may point
far beyond the last set bit. 

The hashCode implementation iterates over all long components of the bits 
buffer, rotating the hash even for empty components. This is against the 
contract of hashCode-equals. The following test case illustrates this:

{code}
// initialize two bitsets with different capacity (bits length).
BitSet bs1 = new BitSet(200);
BitSet bs2 = new BitSet(64);
// set the same bit.
bs1.set(3);
bs2.set(3);

// equals returns true (passes).
assertEquals(bs1, bs2);
// hashCode returns false (against contract).
assertEquals(bs1.hashCode(), bs2.hashCode());
{code}

Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Dawid Weiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-2216:


Attachment: openbitset.patch

> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2213) Small improvements to ArrayUtil.getNextSize

2010-01-16 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801160#action_12801160
 ] 

Dawid Weiss commented on LUCENE-2213:
-

Not to be picky, Michael, but is long promotion required here? Would it be 
easier to see if you overflow into negative integers and if so, set to 
MAX_VALUE?

Another thing -- given the fact that the parameter is an int, you'll never be 
able to grow beyond Integer.MAX_VALUE (because no int value exists). I'd change 
the contract to reflect this fact -- if the signature takes two parameters (int 
currentSize, int expectedAdditions) then it's easy to throw an unchecked 
exception if you simply can't meet the contract:

if (currentSize + expectedAdditions < 0)
  throw new RuntimeException("Cannot allocate array larger than: " + 
Integer.MAX_VALUE);

When reallocating, you can call it with grow(currentSize, 1), just to make sure 
the array will be at least one element larger than previously; the method can 
then make its best effort in estimating the growth ratio, but have a cap on 
MAX_SIZE before overflowing into negative integers (and avoid looping endlessly 
when Integer.MAX_VALUE is passed as an input argument).

These are just thoughts of course -- I've just finished implementing something 
like this for another project...

> Small improvements to ArrayUtil.getNextSize
> ---
>
> Key: LUCENE-2213
> URL: https://issues.apache.org/jira/browse/LUCENE-2213
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2213.patch, LUCENE-2213.patch
>
>
> Spinoff from java-dev thread "Dynamic array reallocation algorithms" started 
> on Jan 12, 2010.
> Here's what I did:
>   * Keep the +3 for small sizes
>   * Added 2nd arg = number of bytes per element.
>   * Round up to 4 or 8 byte boundary (if it's 32 or 64 bit JRE respectively)
>   * Still grow by 1/8th
>   * If 0 is passed in, return 0 back
> I also had to remove some asserts in tests that were checking the actual 
> values returned by this method -- I don't think we should test that (it's an 
> impl. detail).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2206) integrate snowball stopword lists

2010-01-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801163#action_12801163
 ] 

Simon Willnauer commented on LUCENE-2206:
-

Robert, patch looks good except of one thing. 
{code}
  public static HashSet getSnowballWordSet(Reader reader)
{code}

it returns a hashset but should really return a Set. We plan to change 
all return types to the interface instead of the implementation.


> integrate snowball stopword lists
> -
>
> Key: LUCENE-2206
> URL: https://issues.apache.org/jira/browse/LUCENE-2206
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Robert Muir
> Fix For: 3.1
>
> Attachments: LUCENE-2206.patch
>
>
> The snowball project creates stopword lists as well as stemmers, example: 
> http://svn.tartarus.org/snowball/trunk/website/algorithms/english/stop.txt?view=markup
> This patch includes the following:
> * snowball stopword lists for 13 languages in contrib/snowball/resources
> * all stoplists are unmodified, only added license header and converted each 
> one from whatever encoding it was in to UTF-8
> * added getSnowballWordSet  to WordListLoader, this is because the format of 
> these files is very different, for example it supports multiple words per 
> line and embedded comments.
> I did not add any changes to SnowballAnalyzer to actually automatically use 
> these lists yet, i would like us to discuss this in a future issue proposing 
> integrating snowball with contrib/analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2206) integrate snowball stopword lists

2010-01-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801166#action_12801166
 ] 

Robert Muir commented on LUCENE-2206:
-

thanks Simon, I agree

> integrate snowball stopword lists
> -
>
> Key: LUCENE-2206
> URL: https://issues.apache.org/jira/browse/LUCENE-2206
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Robert Muir
> Fix For: 3.1
>
> Attachments: LUCENE-2206.patch
>
>
> The snowball project creates stopword lists as well as stemmers, example: 
> http://svn.tartarus.org/snowball/trunk/website/algorithms/english/stop.txt?view=markup
> This patch includes the following:
> * snowball stopword lists for 13 languages in contrib/snowball/resources
> * all stoplists are unmodified, only added license header and converted each 
> one from whatever encoding it was in to UTF-8
> * added getSnowballWordSet  to WordListLoader, this is because the format of 
> these files is very different, for example it supports multiple words per 
> line and embedded comments.
> I did not add any changes to SnowballAnalyzer to actually automatically use 
> these lists yet, i would like us to discuss this in a future issue proposing 
> integrating snowball with contrib/analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2217) SortedVIntList allocation should use ArrayUtils.getNextSize()

2010-01-16 Thread Paul Elschot (JIRA)
SortedVIntList allocation should use ArrayUtils.getNextSize()
-

 Key: LUCENE-2217
 URL: https://issues.apache.org/jira/browse/LUCENE-2217
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Reporter: Paul Elschot


See recent discussion on ArrayUtils.getNextSize().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2217) SortedVIntList allocation should use ArrayUtils.getNextSize()

2010-01-16 Thread Paul Elschot (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-2217:
-

Attachment: LUCENE-2217.patch

> SortedVIntList allocation should use ArrayUtils.getNextSize()
> -
>
> Key: LUCENE-2217
> URL: https://issues.apache.org/jira/browse/LUCENE-2217
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Paul Elschot
> Attachments: LUCENE-2217.patch
>
>
> See recent discussion on ArrayUtils.getNextSize().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2217) SortedVIntList allocation should use ArrayUtils.getNextSize()

2010-01-16 Thread Paul Elschot (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-2217:
-

Priority: Trivial  (was: Major)

> SortedVIntList allocation should use ArrayUtils.getNextSize()
> -
>
> Key: LUCENE-2217
> URL: https://issues.apache.org/jira/browse/LUCENE-2217
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Paul Elschot
>Priority: Trivial
> Attachments: LUCENE-2217.patch
>
>
> See recent discussion on ArrayUtils.getNextSize().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2206) integrate snowball stopword lists

2010-01-16 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-2206.
-

Resolution: Fixed

Committed revision 899955.

> integrate snowball stopword lists
> -
>
> Key: LUCENE-2206
> URL: https://issues.apache.org/jira/browse/LUCENE-2206
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Robert Muir
> Fix For: 3.1
>
> Attachments: LUCENE-2206.patch
>
>
> The snowball project creates stopword lists as well as stemmers, example: 
> http://svn.tartarus.org/snowball/trunk/website/algorithms/english/stop.txt?view=markup
> This patch includes the following:
> * snowball stopword lists for 13 languages in contrib/snowball/resources
> * all stoplists are unmodified, only added license header and converted each 
> one from whatever encoding it was in to UTF-8
> * added getSnowballWordSet  to WordListLoader, this is because the format of 
> these files is very different, for example it supports multiple words per 
> line and embedded comments.
> I did not add any changes to SnowballAnalyzer to actually automatically use 
> these lists yet, i would like us to discuss this in a future issue proposing 
> integrating snowball with contrib/analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2212) add a test for PorterStemFilter

2010-01-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801181#action_12801181
 ] 

Simon Willnauer commented on LUCENE-2212:
-

Nice robert, I was adding a test class for PorterStemFilter during LUCENE-2198 
to test the KeywordAttr. Yet this looks very good though.
I wonder if we should use GetResourcesAsStream rather than the system property. 
the resources should always be on the classpath.



> add a test for PorterStemFilter
> ---
>
> Key: LUCENE-2212
> URL: https://issues.apache.org/jira/browse/LUCENE-2212
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Analysis
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2212.patch, porterTestData.zip
>
>
> There are no tests for PorterStemFilter, yet svn history reveals some (very 
> minor) cleanups, etc.
> The only thing executing its code in tests is a test or two in SmartChinese 
> tests.
> This patch runs the StemFilter against Martin Porter's test data set for this 
> stemmer, checking for expected output.
> The zip file is 100KB added to src/test, if this is too large I can change it 
> to download the data instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801182#action_12801182
 ] 

Yonik Seeley commented on LUCENE-2216:
--

Thanks Dawid!

hashCode and equals probably shouldn't be modifying the state of the object 
though, right?
It's also not thread safe, so a lot of weird things could happen... the 
simplest example is that two threads could check that the last word is all 
zeros and both decrement wlen.

I like the spirit of your change though, as it only adds to the cost of 
hashCode/equals (which are already very expensive with large bitsets and should 
be avoided if possible anyway).

> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated LUCENE-2216:
-

Attachment: LUCENE-2216.patch

I haven't tested this patch, but this seems like a simple solution.  Start with 
a zero hashcode while iterating backward and the trailing zeros won't affect 
the hashcode.

> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-2216.patch, openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801195#action_12801195
 ] 

Dawid Weiss commented on LUCENE-2216:
-

Hi Yonik,

This class is not thread-safe anyway (there are no memory barriers of any kind 
anywhere in the code).

>From a single-thread perspective, yes, you are modifying the internal state of 
>this object, but it's not really affecting anything other than possibly 
>speeding up further interaction with this object (any other operation no 
>OpenBitSets is affected by the value inside wlen).

Your patch also solves the issue, of course. I just don't see the point in 
_not_ updating wlen since you're scanning through memory anyway... The 
implementation of OpenBitSet is different in this regard to java.util.BitSet, 
which always maintains the last non-empty index. I've been thinking about it a 
bit and there are pros and cons to both implementations, but lazily moving wlen 
when memory is scanned anyway seems like a better alternative than keeping wlen 
unnecessarily large (which affects ORs, ANDs and other set operations).

To me this implementation cannot be used in a multi-threaded application 
anyway, am I wrong here?

D.

> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-2216.patch, openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801198#action_12801198
 ] 

Dawid Weiss commented on LUCENE-2216:
-

Perhaps this is for another patch, but BitUtil contains several bit-counting 
methods (pop, ntz) that have been implemented in the JDK in the same way 
(Hacker's Delight) and will come with HotSpot intrinsics for the new Intels 
(http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6823354). On the other 
hand, Lucene's implementation may be useful for folks with older VMs...

> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-2216.patch, openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801200#action_12801200
 ] 

Yonik Seeley commented on LUCENE-2216:
--

bq. To me this implementation cannot be used in a multi-threaded application 
anyway, am I wrong here?

Pretty much any mutable object may be safely shared with other threads after 
it's done being modified.  So one thread could create, and many threads could 
read.  I don't know how explicitly it's spelled out in Java, but hashCode and  
equals shouldn't modify the object's state in any meaningful way.

> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-2216.patch, openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2212) add a test for PorterStemFilter

2010-01-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801212#action_12801212
 ] 

Robert Muir commented on LUCENE-2212:
-

bq. I wonder if we should use GetResourcesAsStream rather than the system 
property. the resources should always be on the classpath. 

I don't think we should, because otherwise the test becomes complicated.
This test reads thru voc.txt sequentially and for each line in the file, 
verifies the expected output against the same line in output.txt
With ZipFile it does buffering and such transparently to make this very simple.

With ZipInputStream i would have to do this myself.


> add a test for PorterStemFilter
> ---
>
> Key: LUCENE-2212
> URL: https://issues.apache.org/jira/browse/LUCENE-2212
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Analysis
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2212.patch, porterTestData.zip
>
>
> There are no tests for PorterStemFilter, yet svn history reveals some (very 
> minor) cleanups, etc.
> The only thing executing its code in tests is a test or two in SmartChinese 
> tests.
> This patch runs the StemFilter against Martin Porter's test data set for this 
> stemmer, checking for expected output.
> The zip file is 100KB added to src/test, if this is too large I can change it 
> to download the data instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2212) add a test for PorterStemFilter

2010-01-16 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2212:


Attachment: LUCENE-2212.patch

updated patch with getResource() + ZipFile

will commit this test at the end of the day unless anyone objects.

> add a test for PorterStemFilter
> ---
>
> Key: LUCENE-2212
> URL: https://issues.apache.org/jira/browse/LUCENE-2212
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Analysis
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2212.patch, LUCENE-2212.patch, porterTestData.zip
>
>
> There are no tests for PorterStemFilter, yet svn history reveals some (very 
> minor) cleanups, etc.
> The only thing executing its code in tests is a test or two in SmartChinese 
> tests.
> This patch runs the StemFilter against Martin Porter's test data set for this 
> stemmer, checking for expected output.
> The zip file is 100KB added to src/test, if this is too large I can change it 
> to download the data instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2195) Speedup CharArraySet if set is empty

2010-01-16 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned LUCENE-2195:
---

Assignee: Robert Muir

> Speedup CharArraySet if set is empty
> 
>
> Key: LUCENE-2195
> URL: https://issues.apache.org/jira/browse/LUCENE-2195
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Simon Willnauer
>Assignee: Robert Muir
> Fix For: 3.1
>
> Attachments: LUCENE-2195.patch, LUCENE-2195.patch, LUCENE-2195.patch
>
>
> CharArraySet#contains(...) always creates a HashCode of the String, Char[] or 
> CharSequence even if the set is empty. 
> contains should return false if set it empty

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801221#action_12801221
 ] 

Dawid Weiss commented on LUCENE-2216:
-

This is only true if there is happens-before between the reads and the 
modifications to the object. In any other case other threads may be reading 
stale values (i.e., from their own cache), at least if my understanding of the 
jmm is correct here. Whether you want to rely on such a deep semantics of 
interaction between threads is something to consider deeply, at least in my 
personal opinion.

> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-2216.patch, openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2195) Speedup CharArraySet if set is empty

2010-01-16 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2195:


Attachment: LUCENE-2195.patch

Simon i made a few changes (i like the idea of simply a special 
EmptyCharArraySet so this does not slow down anything else and alleviates any 
concerns)

* I do not think unmodifiableset should have a no-arg ctor, so instead i pushed 
this up to emptychararrayset
* i do not think emptychararrayset need override and throw uoe for removeAll or 
retainAll, and i don't think the tests were correct in assuming it will throw 
uoe. it will not throw uoe for say, removeAll only because it is empty. it will 
just do nothing.

now the emptychararrayset conforms with AbstractCollation.remove/retainAll:
Note that this implementation will throw an UnsupportedOperationException if 
the iterator returned by the iterator method does not implement the remove 
method and this collection contains one or more elements in common with/not 
present with the specified collection.


> Speedup CharArraySet if set is empty
> 
>
> Key: LUCENE-2195
> URL: https://issues.apache.org/jira/browse/LUCENE-2195
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Simon Willnauer
>Assignee: Robert Muir
> Fix For: 3.1
>
> Attachments: LUCENE-2195.patch, LUCENE-2195.patch, LUCENE-2195.patch, 
> LUCENE-2195.patch
>
>
> CharArraySet#contains(...) always creates a HashCode of the String, Char[] or 
> CharSequence even if the set is empty. 
> contains should return false if set it empty

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801230#action_12801230
 ] 

Dawid Weiss commented on LUCENE-2216:
-

This is not entirely what I had in mind (it's not cache, but HotSpot 
optimisation), but similar situation applies (the value of the field that's 
never modified from the perspective of the current thread is never re-read).

{code}
public class Example10 {
private static Holder holder;

public static void startThread() {
new Thread() {
public void run() {
try { sleep(2000); } catch (Exception e) { /* ignore */ }
holder.ready = true;
System.out.println("Setting ready to true.");
}
}.start();
}

public static void main(String [] args) {
holder = new Holder();
startThread();
while (!holder.ready) {
// Do nothing.
}
System.out.println("I'm ready.");
}
}

class Holder {
public boolean ready;
}
{/code}

If you run it with -server, it will (should... or does on two machines I own) 
deadlock. Client mode and interpreted mode are not optimized, so it passes.

> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-2216.patch, openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801230#action_12801230
 ] 

Dawid Weiss edited comment on LUCENE-2216 at 1/16/10 5:26 PM:
--

This is not entirely what I had in mind (it's not cache, but HotSpot 
optimisation), but similar situation applies (the value of the field that's 
never modified from the perspective of the current thread is never re-read).

{code}
public class Example10 {
private static Holder holder;

public static void startThread() {
new Thread() {
public void run() {
try { sleep(2000); } catch (Exception e) { /* ignore */ }
holder.ready = true;
System.out.println("Setting ready to true.");
}
}.start();
}

public static void main(String [] args) {
holder = new Holder();
startThread();
while (!holder.ready) {
// Do nothing.
}
System.out.println("I'm ready.");
}
}

class Holder {
public boolean ready;
}
{code}

If you run it with -server, it will (should... or does on two machines I own) 
deadlock. Client mode and interpreted mode are not optimized, so it passes.

  was (Author: dawidweiss):
This is not entirely what I had in mind (it's not cache, but HotSpot 
optimisation), but similar situation applies (the value of the field that's 
never modified from the perspective of the current thread is never re-read).

{code}
public class Example10 {
private static Holder holder;

public static void startThread() {
new Thread() {
public void run() {
try { sleep(2000); } catch (Exception e) { /* ignore */ }
holder.ready = true;
System.out.println("Setting ready to true.");
}
}.start();
}

public static void main(String [] args) {
holder = new Holder();
startThread();
while (!holder.ready) {
// Do nothing.
}
System.out.println("I'm ready.");
}
}

class Holder {
public boolean ready;
}
{/code}

If you run it with -server, it will (should... or does on two machines I own) 
deadlock. Client mode and interpreted mode are not optimized, so it passes.
  
> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-2216.patch, openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801235#action_12801235
 ] 

Yonik Seeley commented on LUCENE-2216:
--

bq. This is only true if there is happens-before between the reads and the 
modifications to the object. 

Of course... I said "may be safely shared', not that any method one chooses to 
share it is correct.
It still seems that promoting hashCode and equals to mutating operations is 
wrong, no?

> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-2216.patch, openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801240#action_12801240
 ] 

Dawid Weiss commented on LUCENE-2216:
-

uff, I started having doubts in my own understanding, thanks for being patient 
with me.

I agree that having hashCode mutate the object's state is weird. I had some 
thoughts about it -- this particular mutation seems to be "safe" even from 
multi-threaded point of view. If another thread sees a stale value of wlen, 
then the only thing that is going to happen is it will scan more memory; for 
ands, ors and other types of operations this will have no effect. So assuming 
hashCode/equals is the ONLY method you're calling concurrently, it shouldn't 
break things. A similar kind of trickery goes on in String#hashCode (caching to 
a non-volatile field), although that object is immutable, so it's a slightly 
different scenario.

To be honest, my preference for this would be to either maintain the wlen field 
during all operations (like java.util.BitSet) or at least to clearly state 
(JavaDoc?) that trimTrailingZeros() should be invoked prior to publishing the 
object for other threads for increased performance (in case you fiddle with 
bits and clear the tail). In the second options, your patch does a fine job of 
not mutating the object and correcting the bug.

Thanks for an interesting discussion.

> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-2216.patch, openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2212) add a test for PorterStemFilter

2010-01-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801243#action_12801243
 ] 

Simon Willnauer commented on LUCENE-2212:
-

bq. updated patch with getResource() + ZipFile 

:) thanks

bq. will commit this test at the end of the day unless anyone objects.
+1 go ahead

> add a test for PorterStemFilter
> ---
>
> Key: LUCENE-2212
> URL: https://issues.apache.org/jira/browse/LUCENE-2212
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Analysis
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2212.patch, LUCENE-2212.patch, porterTestData.zip
>
>
> There are no tests for PorterStemFilter, yet svn history reveals some (very 
> minor) cleanups, etc.
> The only thing executing its code in tests is a test or two in SmartChinese 
> tests.
> This patch runs the StemFilter against Martin Porter's test data set for this 
> stemmer, checking for expected output.
> The zip file is 100KB added to src/test, if this is too large I can change it 
> to download the data instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801249#action_12801249
 ] 

Yonik Seeley commented on LUCENE-2216:
--

bq. I agree that having hashCode mutate the object's state is weird. I had some 
thoughts about it - this particular mutation seems to be "safe" even from 
multi-threaded point of view. If another thread sees a stale value of wlen, 
then the only thing that is going to happen is it will scan more memory;

There are still quite a few things that can go wrong I think.  If all threads 
*only* called hashCode and equals, then you *might* be right... it's very 
specific to the implementation of trimTrailingZeros()
{code}
   public void trimTrailingZeros() {
int idx = wlen-1;
while (idx>=0 && bits[idx]==0) idx--;
wlen = idx+1;
  }
{code}
What could make that work is the fact that wlen is an integer, is never 
directly used as the loop counter, or as an index into the array.

But the other big questions: are other read operations tolerant of wlen 
changing out from under them?  My guess would be no.
Look at xorCount for example:
{code}
if (a.wlen < b.wlen) {
  tot += BitUtil.pop_array(b.bits, a.wlen, b.wlen-a.wlen);
{code}
hashCode and equals changing wlen could cause a negative value to be passed to 
pop_array.

Yet another problem: say someone actually does want to change the set 
occasionally.  One way they could safely do this is to use a read-write lock to 
allow any number of readers to read the set as long as a writer wasn't writing 
it.  But equals and hashCode would need to be categorized under "write" methods 
for this to work... (definitely unexpected) otherwise all sorts of bad stuff 
would happen.


> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-2216.patch, openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-01-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2186:
---

Attachment: LUCENE-2186.patch

Attaching my current state -- things are still very very rough, and there are 
contrib/remote test failures.

This patch has an initial integration with Lucene, enabling a Field to set how 
its values should be indexed into CSF... values are merged during indexign, and 
I created FieldComparators to use them for sorting.

There are still some outright hacks in there, under nocommits (eg how 
SegmentInfo.files() computes the CSF files)...

I'm now thinking we should wrap up & land flex, before going much further on 
this feature...

> First cut at column-stride fields (index values storage)
> 
>
> Key: LUCENE-2186
> URL: https://issues.apache.org/jira/browse/LUCENE-2186
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2186.patch, LUCENE-2186.patch
>
>
> I created an initial basic impl for storing "index values" (ie
> column-stride value storage).  This is still a work in progress... but
> the approach looks compelling.  I'm posting my current status/patch
> here to get feedback/iterate, etc.
> The code is standalone now, and lives under new package
> oal.index.values (plus some util changes, refactorings) -- I have yet
> to integrate into Lucene so eg you can mark that a given Field's value
> should be stored into the index values, sorting will use these values
> instead of field cache, etc.
> It handles 3 types of values:
>   * Six variants of byte[] per doc, all combinations of fixed vs
> variable length, and stored either "straight" (good for eg a
> "title" field), "deref" (good when many docs share the same value,
> but you won't do any sorting) or "sorted".
>   * Integers (variable bit precision used as necessary, ie this can
> store byte/short/int/long, and all precisions in between)
>   * Floats (4 or 8 byte precision)
> String fields are stored as the UTF8 byte[].  This patch adds a
> BytesRef, which does the same thing as flex's TermRef (we should merge
> them).
> This patch also adds basic initial impl of PackedInts (LUCENE-1990);
> we can swap that out if/when we get a better impl.
> This storage is dense (like field cache), so it's appropriate when the
> field occurs in all/most docs.  It's just like field cache, except the
> reading API is a get() method invocation, per document.
> Next step is to do basic integration with Lucene, and then compare
> sort performance of this vs field cache.
> For the "sort by String value" case, I think RAM usage & GC load of
> this index values API should be much better than field caache, since
> it does not create object per document (instead shares big long[] and
> byte[] across all docs), and because the values are stored in RAM as
> their UTF8 bytes.
> There are abstract Writer/Reader classes.  The current reader impls
> are entirely RAM resident (like field cache), but the API is (I think)
> agnostic, ie, one could make an MMAP impl instead.
> I think this is the first baby step towards LUCENE-1231.  Ie, it
> cannot yet update values, and the reading API is fully random-access
> by docID (like field cache), not like a posting list, though I
> do think we should add an iterator() api (to return flex's DocsEnum)
> -- eg I think this would be a good way to track avg doc/field length
> for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801249#action_12801249
 ] 

Yonik Seeley edited comment on LUCENE-2216 at 1/16/10 6:34 PM:
---

bq. I agree that having hashCode mutate the object's state is weird. I had some 
thoughts about it - this particular mutation seems to be "safe" even from 
multi-threaded point of view. If another thread sees a stale value of wlen, 
then the only thing that is going to happen is it will scan more memory;

There are still quite a few things that can go wrong I think.  If all threads 
*only* called hashCode and equals, then you *might* be right... it's very 
specific to the implementation of trimTrailingZeros()
{code}
   public void trimTrailingZeros() {
int idx = wlen-1;
while (idx>=0 && bits[idx]==0) idx--;
wlen = idx+1;
  }
{code}
What could make that work is the fact that wlen is an integer, is never 
directly used as the loop counter, or as an index into the array.

But the other big questions: are other read operations tolerant of wlen 
changing out from under them?  My guess would be no.
Look at xorCount for example:
{code}
if (a.wlen < b.wlen) {
  tot += BitUtil.pop_array(b.bits, a.wlen, b.wlen-a.wlen);
{code}
hashCode and equals changing wlen could cause a negative value to be passed to 
pop_array.

edit: deleted second example, which isn't different from the first (the issue 
is safety with other read ops).

  was (Author: ysee...@gmail.com):
bq. I agree that having hashCode mutate the object's state is weird. I had 
some thoughts about it - this particular mutation seems to be "safe" even from 
multi-threaded point of view. If another thread sees a stale value of wlen, 
then the only thing that is going to happen is it will scan more memory;

There are still quite a few things that can go wrong I think.  If all threads 
*only* called hashCode and equals, then you *might* be right... it's very 
specific to the implementation of trimTrailingZeros()
{code}
   public void trimTrailingZeros() {
int idx = wlen-1;
while (idx>=0 && bits[idx]==0) idx--;
wlen = idx+1;
  }
{code}
What could make that work is the fact that wlen is an integer, is never 
directly used as the loop counter, or as an index into the array.

But the other big questions: are other read operations tolerant of wlen 
changing out from under them?  My guess would be no.
Look at xorCount for example:
{code}
if (a.wlen < b.wlen) {
  tot += BitUtil.pop_array(b.bits, a.wlen, b.wlen-a.wlen);
{code}
hashCode and equals changing wlen could cause a negative value to be passed to 
pop_array.

Yet another problem: say someone actually does want to change the set 
occasionally.  One way they could safely do this is to use a read-write lock to 
allow any number of readers to read the set as long as a writer wasn't writing 
it.  But equals and hashCode would need to be categorized under "write" methods 
for this to work... (definitely unexpected) otherwise all sorts of bad stuff 
would happen.

  
> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-2216.patch, openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2217) SortedVIntList allocation should use ArrayUtils.getNextSize()

2010-01-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-2217:
--

Assignee: Michael McCandless

> SortedVIntList allocation should use ArrayUtils.getNextSize()
> -
>
> Key: LUCENE-2217
> URL: https://issues.apache.org/jira/browse/LUCENE-2217
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Paul Elschot
>Assignee: Michael McCandless
>Priority: Trivial
> Attachments: LUCENE-2217.patch
>
>
> See recent discussion on ArrayUtils.getNextSize().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2217) SortedVIntList allocation should use ArrayUtils.getNextSize()

2010-01-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801253#action_12801253
 ] 

Michael McCandless commented on LUCENE-2217:


Makes sense!

> SortedVIntList allocation should use ArrayUtils.getNextSize()
> -
>
> Key: LUCENE-2217
> URL: https://issues.apache.org/jira/browse/LUCENE-2217
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Paul Elschot
>Assignee: Michael McCandless
>Priority: Trivial
> Attachments: LUCENE-2217.patch
>
>
> See recent discussion on ArrayUtils.getNextSize().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2195) Speedup CharArraySet if set is empty

2010-01-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801252#action_12801252
 ] 

Simon Willnauer commented on LUCENE-2195:
-

bq. I do not think unmodifiableset should have a no-arg ctor, so instead i 
pushed this up to emptychararrayset
ok I'm fine with that.

{quote}
i do not think emptychararrayset need override and throw uoe for removeAll or 
retainAll, and i don't think the tests were correct in assuming it will throw 
uoe. it will not throw uoe for say, removeAll only because it is empty. it will 
just do nothing.
{quote}

You are right, this should only throw this exception if the set contains it and 
the Iterator does not implement remove()
{code}
 * Note that this implementation throws an
 * UnsupportedOperationException if the iterator returned by this
 * collection's iterator method does not implement the remove
 * method and this collection contains the specified object.
{code}

same is true for AbstractSet#removeAll()  & retainAll()

Thanks for updating it. I think this is good to go though! 



> Speedup CharArraySet if set is empty
> 
>
> Key: LUCENE-2195
> URL: https://issues.apache.org/jira/browse/LUCENE-2195
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Simon Willnauer
>Assignee: Robert Muir
> Fix For: 3.1
>
> Attachments: LUCENE-2195.patch, LUCENE-2195.patch, LUCENE-2195.patch, 
> LUCENE-2195.patch
>
>
> CharArraySet#contains(...) always creates a HashCode of the String, Char[] or 
> CharSequence even if the set is empty. 
> contains should return false if set it empty

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2217) SortedVIntList allocation should use ArrayUtils.getNextSize()

2010-01-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801254#action_12801254
 ] 

Michael McCandless commented on LUCENE-2217:


Actually the patch isn't quite right, I think?  First, it just calls 
ArrayUtil.getNextSize w/o passing that to resizeBytes?

Second, it needs to pass lastBytePos + MAX_BYTES_PER_INT as the arg to 
ArrayUtil.getNextSize (ie, that's the "min target size")?

> SortedVIntList allocation should use ArrayUtils.getNextSize()
> -
>
> Key: LUCENE-2217
> URL: https://issues.apache.org/jira/browse/LUCENE-2217
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Paul Elschot
>Assignee: Michael McCandless
>Priority: Trivial
> Attachments: LUCENE-2217.patch
>
>
> See recent discussion on ArrayUtils.getNextSize().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2213) Small improvements to ArrayUtil.getNextSize

2010-01-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801257#action_12801257
 ] 

Michael McCandless commented on LUCENE-2213:


Yeah... it's not great promoting to long.  It's because of the intermediate 
computation, to do the byte alignment.  I could instead do that computation 
back in the "num elements" space... let me try that.  Then I can check for 
negative ints to catch the overflow.

Note that the args are different from above -- first arg is "min target size", 
ie, this method must return a size >= that requested size.  2nd arg is the 
element size.  Like calloc.

> Small improvements to ArrayUtil.getNextSize
> ---
>
> Key: LUCENE-2213
> URL: https://issues.apache.org/jira/browse/LUCENE-2213
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2213.patch, LUCENE-2213.patch
>
>
> Spinoff from java-dev thread "Dynamic array reallocation algorithms" started 
> on Jan 12, 2010.
> Here's what I did:
>   * Keep the +3 for small sizes
>   * Added 2nd arg = number of bytes per element.
>   * Round up to 4 or 8 byte boundary (if it's 32 or 64 bit JRE respectively)
>   * Still grow by 1/8th
>   * If 0 is passed in, return 0 back
> I also had to remove some asserts in tests that were checking the actual 
> values returned by this method -- I don't think we should test that (it's an 
> impl. detail).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2213) Small improvements to ArrayUtil.getNextSize

2010-01-16 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801261#action_12801261
 ] 

Yonik Seeley commented on LUCENE-2213:
--

Instead of masking with 7fff... you can mask with ... and let it naturally 
overflow to a negative.  No need to check the return value, it should then fail 
immediately when the value is used.

> Small improvements to ArrayUtil.getNextSize
> ---
>
> Key: LUCENE-2213
> URL: https://issues.apache.org/jira/browse/LUCENE-2213
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2213.patch, LUCENE-2213.patch
>
>
> Spinoff from java-dev thread "Dynamic array reallocation algorithms" started 
> on Jan 12, 2010.
> Here's what I did:
>   * Keep the +3 for small sizes
>   * Added 2nd arg = number of bytes per element.
>   * Round up to 4 or 8 byte boundary (if it's 32 or 64 bit JRE respectively)
>   * Still grow by 1/8th
>   * If 0 is passed in, return 0 back
> I also had to remove some asserts in tests that were checking the actual 
> values returned by this method -- I don't think we should test that (it's an 
> impl. detail).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801263#action_12801263
 ] 

Dawid Weiss commented on LUCENE-2216:
-

Chances of this happening are really slim (this would probably be a single 
inlined read as soon as the compilation takes place, but you're right in the 
general case. I am not arguing changing the object in hashCode is good -- my 
argument is that ideally it should be fixed elsewhere (as in my previous 
suggestion -- either updating wlen every time the tail changes, or make 
explicit changes to the documentation that inform about suboptimal performance 
for zero-tailed sets).

> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-2216.patch, openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801265#action_12801265
 ] 

Dawid Weiss commented on LUCENE-2216:
-

For what it's worth, I checked the mentioned BitUtil methods -- ntz/pop; the 
same implementation is included from Java 1.5 upward. Do you want me to file 
another patch for this, Yonik, or are we leaving this as-is? I'd redirect from 
BitUtil to Long/Integer, deprecate BitUtil methods and replace the places in 
the code where they are used.

> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-2216.patch, openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801266#action_12801266
 ] 

Yonik Seeley commented on LUCENE-2216:
--

bq. my argument is that ideally it should be fixed elsewhere

This is an expert-level class... I don't think that every call to clear() 
should be checking if it completely cleared the last word.  It's easy enough to 
call trimTrailingZeros after you did a bunch of modifications... but not so 
easy to regain the lost performance for the code doing redundant checking you 
didn't want.


> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-2216.patch, openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801269#action_12801269
 ] 

Dawid Weiss commented on LUCENE-2216:
-

Ok, argument accepted.

> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-2216.patch, openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801270#action_12801270
 ] 

Yonik Seeley commented on LUCENE-2216:
--

bq. For what it's worth, I checked the mentioned BitUtil methods - ntz/pop; the 
same implementation is included from Java 1.5 upward.

Huh - I didn't realize that Java5 had the same pop impl as I did... it will be 
cool if it finally starts using native POPCNT instructions.

As far as ntz, I went though a lot of micro-optimizations and different 
implementations before I settled on the one used in BitUtil, so it would be 
nice to do some benchmarks to see if it's truly faster now (and also what the 
performance difference is for users of JVMs before this optimization was 
implemented).

> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-2216.patch, openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801272#action_12801272
 ] 

Dawid Weiss commented on LUCENE-2216:
-

Ah, ok -- I thought ntz in BitUtils is the same as in hacker's delight, but it 
isn't. Microbenchmarks will always be misleading as they depend a lot on how 
you test, but I can do it out of sheer curiosity -- will report tomorrow.

> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-2216.patch, openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801275#action_12801275
 ] 

Yonik Seeley commented on LUCENE-2216:
--

Microbenchmarks will always be misleading as they depend a lot on how you test, 
but I can do it out of sheer curiosity - will report tomorrow.

Cool.  I'd recommend testing in the context of OpenBitSet (i.e. don't try 
testing ntz directly).
Perhaps just create a large random set (~1M bits) with a certain percent of 
bits set, and then iterate over those set bits.

> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-2216.patch, openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2213) Small improvements to ArrayUtil.getNextSize

2010-01-16 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801278#action_12801278
 ] 

Dawid Weiss commented on LUCENE-2213:
-

What Yonik suggested is yet another alternative, just return negative size and 
an exception will be thrown elsewhere.

> Small improvements to ArrayUtil.getNextSize
> ---
>
> Key: LUCENE-2213
> URL: https://issues.apache.org/jira/browse/LUCENE-2213
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2213.patch, LUCENE-2213.patch
>
>
> Spinoff from java-dev thread "Dynamic array reallocation algorithms" started 
> on Jan 12, 2010.
> Here's what I did:
>   * Keep the +3 for small sizes
>   * Added 2nd arg = number of bytes per element.
>   * Round up to 4 or 8 byte boundary (if it's 32 or 64 bit JRE respectively)
>   * Still grow by 1/8th
>   * If 0 is passed in, return 0 back
> I also had to remove some asserts in tests that were checking the actual 
> values returned by this method -- I don't think we should test that (it's an 
> impl. detail).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2216) OpenBitSet#hashCode() may return false for identical sets.

2010-01-16 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley resolved LUCENE-2216.
--

   Resolution: Fixed
Fix Version/s: 3.1

Committed in trunk.  Thanks for bringing this up!

> OpenBitSet#hashCode() may return false for identical sets.
> --
>
> Key: LUCENE-2216
> URL: https://issues.apache.org/jira/browse/LUCENE-2216
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Dawid Weiss
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2216.patch, openbitset.patch
>
>
> OpenBitSet uses an internal buffer of long variables to store set bits and an 
> additional 'wlen' index that points 
> to the highest used component inside {...@link #bits} buffer.
> Unlike in JDK, the wlen field is not continuously maintained (on clearing 
> bits, for example). This leads to a situation when wlen may point
> far beyond the last set bit. 
> The hashCode implementation iterates over all long components of the bits 
> buffer, rotating the hash even for empty components. This is against the 
> contract of hashCode-equals. The following test case illustrates this:
> {code}
> // initialize two bitsets with different capacity (bits length).
> BitSet bs1 = new BitSet(200);
> BitSet bs2 = new BitSet(64);
> // set the same bit.
> bs1.set(3);
> bs2.set(3);
> 
> // equals returns true (passes).
> assertEquals(bs1, bs2);
> // hashCode returns false (against contract).
> assertEquals(bs1.hashCode(), bs2.hashCode());
> {code}
> Fix and test case attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2212) add a test for PorterStemFilter

2010-01-16 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-2212.
-

Resolution: Fixed

Committed revision 900031.

> add a test for PorterStemFilter
> ---
>
> Key: LUCENE-2212
> URL: https://issues.apache.org/jira/browse/LUCENE-2212
> Project: Lucene - Java
>  Issue Type: Test
>  Components: Analysis
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2212.patch, LUCENE-2212.patch, porterTestData.zip
>
>
> There are no tests for PorterStemFilter, yet svn history reveals some (very 
> minor) cleanups, etc.
> The only thing executing its code in tests is a test or two in SmartChinese 
> tests.
> This patch runs the StemFilter against Martin Porter's test data set for this 
> stemmer, checking for expected output.
> The zip file is 100KB added to src/test, if this is too large I can change it 
> to download the data instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2217) SortedVIntList allocation should use ArrayUtils.getNextSize()

2010-01-16 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801295#action_12801295
 ] 

Paul Elschot commented on LUCENE-2217:
--

Indeed, the patch isn't quite right. I'll fix that and provide another patch. 
All test cases pass though, so I'll also try and
add a test case that fails when an allocation larger than the current initial 
size is needed.

The MAX_BYTE_PER_INT has disappeared into an added comment that states a 
minimum initial size.
The underlying problem is that ArrayUtils.getNextSize() does not have an 
argument for a minimum increase.
Would it make sense to add that, too? The code there has some strange constants 
 (3, 6 and 9) that could
perhaps be dropped when an extra argument for a minimum increase is added.
Looking at the comment there for the growth pattern, shouldn't the second 
number (after 0) be 3 instead of 4?


> SortedVIntList allocation should use ArrayUtils.getNextSize()
> -
>
> Key: LUCENE-2217
> URL: https://issues.apache.org/jira/browse/LUCENE-2217
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Paul Elschot
>Assignee: Michael McCandless
>Priority: Trivial
> Attachments: LUCENE-2217.patch
>
>
> See recent discussion on ArrayUtils.getNextSize().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2207) CJKTokenizer generates tokens with incorrect offsets

2010-01-16 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-2207:
---

Attachment: LUCENE-2207.patch

Hi Robert, thank you for looking this so quickly!

{quote}
ok i found the bug. the problem is incrementToken() unconditionally increments 
the offset before it starts its main loop:

line 165:

offset++;
{quote}

Indeed.

In attached patch, I added one more offset-- line and two more testcases. All 
tests pass.

> CJKTokenizer generates tokens with incorrect offsets
> 
>
> Key: LUCENE-2207
> URL: https://issues.apache.org/jira/browse/LUCENE-2207
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Reporter: Koji Sekiguchi
> Attachments: LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, 
> LUCENE-2207.patch, TestCJKOffset.java
>
>
> If I index a Japanese *multi-valued* document with CJKTokenizer and highlight 
> a term with FastVectorHighlighter, the output snippets have incorrect 
> highlighted string. I'll attach a program that reproduces the problem soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2207) CJKTokenizer generates tokens with incorrect offsets

2010-01-16 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801312#action_12801312
 ] 

Koji Sekiguchi edited comment on LUCENE-2207 at 1/17/10 4:54 AM:
-

Hi Robert, thank you for looking this so quickly!

{quote}
ok i found the bug. the problem is incrementToken() unconditionally increments 
the offset before it starts its main loop:

line 165:

offset++;
{quote}

Indeed.

In attached patch, I added one more offset-- line and two more testcases. All 
tests pass and this patch fixes the original problem that was found in Solr 
with FastVectorHighlighter.

  was (Author: koji):
Hi Robert, thank you for looking this so quickly!

{quote}
ok i found the bug. the problem is incrementToken() unconditionally increments 
the offset before it starts its main loop:

line 165:

offset++;
{quote}

Indeed.

In attached patch, I added one more offset-- line and two more testcases. All 
tests pass.
  
> CJKTokenizer generates tokens with incorrect offsets
> 
>
> Key: LUCENE-2207
> URL: https://issues.apache.org/jira/browse/LUCENE-2207
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Reporter: Koji Sekiguchi
> Attachments: LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, 
> LUCENE-2207.patch, TestCJKOffset.java
>
>
> If I index a Japanese *multi-valued* document with CJKTokenizer and highlight 
> a term with FastVectorHighlighter, the output snippets have incorrect 
> highlighted string. I'll attach a program that reproduces the problem soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Hudson build is back to normal: Lucene-trunk #1065

2010-01-16 Thread Apache Hudson Server
See 



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2218) ShingleFilter improvements

2010-01-16 Thread Steven Rowe (JIRA)
ShingleFilter improvements
--

 Key: LUCENE-2218
 URL: https://issues.apache.org/jira/browse/LUCENE-2218
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 3.0
Reporter: Steven Rowe
Priority: Minor


ShingleFilter should allow configuration of minimum shingle size (in addition 
to maximum shingle size), so that it's possible to (e.g.) output only trigrams 
instead of bigrams mixed with trigrams.  The token separator used in composing 
shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2207) CJKTokenizer generates tokens with incorrect offsets

2010-01-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801336#action_12801336
 ] 

Robert Muir commented on LUCENE-2207:
-

bq. In attached patch, I added one more offset-- line and two more testcases. 
All tests pass and this patch fixes the original problem that was found in Solr 
with FastVectorHighlighter.

nice, fix looks good to me.

> CJKTokenizer generates tokens with incorrect offsets
> 
>
> Key: LUCENE-2207
> URL: https://issues.apache.org/jira/browse/LUCENE-2207
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Reporter: Koji Sekiguchi
> Attachments: LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, 
> LUCENE-2207.patch, TestCJKOffset.java
>
>
> If I index a Japanese *multi-valued* document with CJKTokenizer and highlight 
> a term with FastVectorHighlighter, the output snippets have incorrect 
> highlighted string. I'll attach a program that reproduces the problem soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2218) ShingleFilter improvements

2010-01-16 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2218:


Attachment: LUCENE-2218.benchmark.patch
LUCENE-2218.patch

Patch implementing new features, and a patch for a new contrib/benchmark target 
"shingle", including a new task NewShingleAnalyzerTask.

ShingleFilter is largely rewritten here in order to support the new 
configurable minimum shingle size.



> ShingleFilter improvements
> --
>
> Key: LUCENE-2218
> URL: https://issues.apache.org/jira/browse/LUCENE-2218
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 3.0
>Reporter: Steven Rowe
>Priority: Minor
> Attachments: LUCENE-2218.benchmark.patch, LUCENE-2218.patch
>
>
> ShingleFilter should allow configuration of minimum shingle size (in addition 
> to maximum shingle size), so that it's possible to (e.g.) output only 
> trigrams instead of bigrams mixed with trigrams.  The token separator used in 
> composing shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2218) ShingleFilter improvements

2010-01-16 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801341#action_12801341
 ] 

Steven Rowe commented on LUCENE-2218:
-

The rewrite included some optimizations (e.g., no longer constructing n 
StringBuilders for every position in the input stream), and the performance is 
now modestly better - below is a comparison generated using the attached 
benchmark patch:

JAVA:
java version "1.5.0_15"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_15-b04)
Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_15-b04, mixed mode)

OS:
cygwin
WinVistaService Pack 2
Service Pack 26060022202561

||Max Shingle 
Size||Unigrams?||Unpatched||Patched||StandardAnalyzer||Improvement||
|2|no|4.92|4.74|2.19|7.5%|
|2|yes|5.04|4.90|2.19|5.6%|
|4|no|6.21|5.82|2.19|11.2%|
|4|yes|6.41|5.97|2.19|12.1%|


> ShingleFilter improvements
> --
>
> Key: LUCENE-2218
> URL: https://issues.apache.org/jira/browse/LUCENE-2218
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 3.0
>Reporter: Steven Rowe
>Priority: Minor
> Attachments: LUCENE-2218.benchmark.patch, LUCENE-2218.patch
>
>
> ShingleFilter should allow configuration of minimum shingle size (in addition 
> to maximum shingle size), so that it's possible to (e.g.) output only 
> trigrams instead of bigrams mixed with trigrams.  The token separator used in 
> composing shingles should be configurable too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org