from:"Adrien Grand \(JIRA\)"

[jira] [Updated] (LUCENE-5098) Broadword bit selection

2013-07-11 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5098:
-

Assignee: Adrien Grand

 Broadword bit selection
 ---

 Key: LUCENE-5098
 URL: https://issues.apache.org/jira/browse/LUCENE-5098
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Paul Elschot
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5098.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5100) BaseDocIdSetTestCase

2013-07-11 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5100:
-

Attachment: LUCENE-5100.patch

Thanks for the explanation, Robert. I tried to factorize some code between 
TestFixedBitSet and TestOpenBitSet by adding an abstraction level on top of 
both FixedBitSet and OpenBitSet but its complexity made the tests even harder 
to read, so I think I won't touch the prevSetBit/nextSetBit/flip/... tests and 
just add the tests from {{BaseDcIdSetTestCase}}.

Updated patch. The modification in EliasFanoEncoder is here to always be able 
to pass {{maxDoc - 1}} as an upper bound even when the set is empty (an 
assertion would trip otherwise). I think it is ready?

 BaseDocIdSetTestCase
 

 Key: LUCENE-5100
 URL: https://issues.apache.org/jira/browse/LUCENE-5100
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Attachments: LUCENE-5100.patch, LUCENE-5100.patch


 As Robert said on LUCENE-5081, we would benefit from having common testing 
 infrastructure for our DocIdSet implementations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5098) Broadword bit selection

2013-07-11 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13705827#comment-13705827
 ] 

Adrien Grand commented on LUCENE-5098:
--

It does.

 Broadword bit selection
 ---

 Key: LUCENE-5098
 URL: https://issues.apache.org/jira/browse/LUCENE-5098
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Paul Elschot
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5098.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2750) add Kamikaze 3.0.1 into Lucene

2013-07-11 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13705876#comment-13705876
 ] 

Adrien Grand commented on LUCENE-2750:
--

FYI I ran his benchmark and the thing is that the version of kamikaze he is 
using decompresses ints one by one instead of using routines that decompress a 
full block in one go. Here is the relevant part of the kamikaze code base: 
https://github.com/linkedin/kamikaze/blob/master/src/main/java/com/kamikaze/pfordelta/PForDelta.java#L114
 decompressBBitSlotsWithHardCodes is commented out in favor of 
decompressBBitSlots.

 add Kamikaze 3.0.1 into Lucene
 --

 Key: LUCENE-2750
 URL: https://issues.apache.org/jira/browse/LUCENE-2750
 Project: Lucene - Core
  Issue Type: Sub-task
  Components: modules/other
Reporter: hao yan
Assignee: Adrien Grand
   Original Estimate: 336h
  Remaining Estimate: 336h

 Kamikaze 3.0.1 is the updated version of Kamikaze 2.0.0. It can achieve 
 significantly better performance then Kamikaze 2.0.0 in terms of both 
 compressed size and decompression speed. The main difference between the two 
 versions is Kamikaze 3.0.x uses the much more efficient implementation of the 
 PForDelta compression algorithm. My goal is to integrate the highly efficient 
 PForDelta implementation into Lucene Codec.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5098) Broadword bit selection

2013-07-11 Thread Adrien Grand (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13706372#comment-13706372
]

Adrien Grand commented on LUCENE-5098:
--

bq. A safe conclusion is that moving selectNaive to the test cases now would be
premature.

OK.

bq. I have not actually benchmarked rank9 it against Long.bitCount, but I think
we should do that just to be sure that rank9 is slower, and than it can be made
package-private.

I played a bit with it and rank9 was always between 15% and 20% slower than
bitCount no matter what the input was (which is still impressing since bitCount
is supposed to be an intrinsic). We used to have a utility method in BitUtil to
compute pop counts on longs but we removed it in LUCENE-2221 in favor of
Long.bitCount.

bq. How about putting the assembly version in BitUtil?

Or ToStringUtils?

bq. Should LuceneTestCase also be mentioned in the wiki at How to contribute?

We try to keep this page as concise as possible so I added a mention to it in
https://wiki.apache.org/lucene-java/DeveloperTips.

Broadword bit selection
---

Key: LUCENE-5098
URL: https://issues.apache.org/jira/browse/LUCENE-5098
Project: Lucene - Core
Issue Type: Improvement
Components: core/other
Reporter: Paul Elschot
Assignee: Adrien Grand
Priority: Minor
Attachments: LUCENE-5098.patch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Closed] (LUCENE-5105) IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS has no effect

2013-07-12 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand closed LUCENE-5105.


Resolution: Invalid

IndexOptions only apply to the inverted index. For term vectors, please use the 
FieldType.setStoreTermVectors* methods.

 IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS has no effect
 ---

 Key: LUCENE-5105
 URL: https://issues.apache.org/jira/browse/LUCENE-5105
 Project: Lucene - Core
  Issue Type: Bug
 Environment: In lucene 4.2 
Reporter: milesli

 In lucene 4.2 
 it is not effective to set indexOptions to 
 DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS,
 positions and offsets are also not stored with termvector.
 I have to  set StoreTermVectorOffsets to true and set 
 StoreTermVectorPositions to true that is effective .

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-5100) BaseDocIdSetTestCase

2013-07-12 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-5100.
--

   Resolution: Fixed
Fix Version/s: 4.5

 BaseDocIdSetTestCase
 

 Key: LUCENE-5100
 URL: https://issues.apache.org/jira/browse/LUCENE-5100
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Fix For: 4.5

 Attachments: LUCENE-5100.patch, LUCENE-5100.patch


 As Robert said on LUCENE-5081, we would benefit from having common testing 
 infrastructure for our DocIdSet implementations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2750) add Kamikaze 3.0.1 into Lucene

2013-07-12 Thread Adrien Grand (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Adrien Grand updated LUCENE-2750:
-

Attachment: LUCENE-2750.patch

I wrote an implementation of a PForDeltaDocIdSet based on the ones in Kamikaze
and D. Lemire's JavaFastPFOR (both are licensed under the ASL 2.0).

On the contrary to the original implementation, it uses FOR to encode
exceptions (this was easier given that we already have lots of utility methods
to pack integers).

add Kamikaze 3.0.1 into Lucene
--

Key: LUCENE-2750
URL: https://issues.apache.org/jira/browse/LUCENE-2750
Project: Lucene - Core
Issue Type: Sub-task
Components: modules/other
Reporter: hao yan
Assignee: Adrien Grand
Attachments: LUCENE-2750.patch

Original Estimate: 336h
Remaining Estimate: 336h

Kamikaze 3.0.1 is the updated version of Kamikaze 2.0.0. It can achieve
significantly better performance then Kamikaze 2.0.0 in terms of both
compressed size and decompression speed. The main difference between the two
versions is Kamikaze 3.0.x uses the much more efficient implementation of the
PForDelta compression algorithm. My goal is to integrate the highly efficient
PForDelta implementation into Lucene Codec.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5101) make it easier to plugin different bitset implementations to CachingWrapperFilter

2013-07-12 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707268#comment-13707268
 ] 

Adrien Grand commented on LUCENE-5101:
--

A quick note about alternative DocIdSet we now have. I wrote a benchmark 
(attached) to see how they compared to FixedBitSet, you can look at the results 
here: http://people.apache.org/~jpountz/doc_id_sets.html Please note that 
EliasFanoDocIdSet is disadvantaged for advance() since it doesn't have an index 
yet, it will be interesting to run this benchmark again when it gets one.

Maybe we could use these numbers to have better defaults in CWF? (and only use 
FixedBitSet for dense sets for example)

 make it easier to plugin different bitset implementations to 
 CachingWrapperFilter
 -

 Key: LUCENE-5101
 URL: https://issues.apache.org/jira/browse/LUCENE-5101
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Robert Muir

 Currently this is possible, but its not so friendly:
 {code}
   protected DocIdSet docIdSetToCache(DocIdSet docIdSet, AtomicReader reader) 
 throws IOException {
 if (docIdSet == null) {
   // this is better than returning null, as the nonnull result can be 
 cached
   return EMPTY_DOCIDSET;
 } else if (docIdSet.isCacheable()) {
   return docIdSet;
 } else {
   final DocIdSetIterator it = docIdSet.iterator();
   // null is allowed to be returned by iterator(),
   // in this case we wrap with the sentinel set,
   // which is cacheable.
   if (it == null) {
 return EMPTY_DOCIDSET;
   } else {
 /* INTERESTING PART */
 final FixedBitSet bits = new FixedBitSet(reader.maxDoc());
 bits.or(it);
 return bits;
 /* END INTERESTING PART */
   }
 }
   }
 {code}
 Is there any value to having all this other logic in the protected API? It 
 seems like something thats not useful for a subclass... Maybe this stuff can 
 become final, and INTERESTING PART calls a simpler method, something like:
 {code}
 protected DocIdSet cacheImpl(DocIdSetIterator iterator, AtomicReader reader) {
   final FixedBitSet bits = new FixedBitSet(reader.maxDoc());
   bits.or(iterator);
   return bits;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5098) Broadword bit selection

2013-07-12 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707384#comment-13707384
 ] 

Adrien Grand commented on LUCENE-5098:
--

Committed. Thanks Paul and Dawid!

 Broadword bit selection
 ---

 Key: LUCENE-5098
 URL: https://issues.apache.org/jira/browse/LUCENE-5098
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Paul Elschot
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5098.patch, LUCENE-5098.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5109) EliasFano value index

2013-07-12 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5109:
-

Assignee: Adrien Grand

 EliasFano value index
 -

 Key: LUCENE-5109
 URL: https://issues.apache.org/jira/browse/LUCENE-5109
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Paul Elschot
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5109.patch


 Index upper bits of Elias-Fano sequence.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5101) make it easier to plugin different bitset implementations to CachingWrapperFilter

2013-07-14 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13708113#comment-13708113
 ] 

Adrien Grand commented on LUCENE-5101:
--

bq. Do WAH8 and PFOR already have an index?

They do, but the index is naive: it is a plain binary search over a subset of 
the (docID,position) pairs contained in the set. With the first versions of 
these DocIdSets, I just wanted to guarantee O(log(cardinality)) advance 
performance.

bq. Block decoding might still be added to EliasFano, which should improve its 
nextDoc() performance

The main use-case I see for these sets is to be used as filters. So I think 
advance() performance is more important?

bq. The Elias-Fano code is not tuned yet, so I'm surprised that the Elias-Fano 
time for nextDoc() is less than a factor two worse than PFOR.

Well, the PFOR doc ID set is not tuned either. :-) But I agree this is a good 
surprise for the Elias-Fano set. I mean even the WAH8 doc id set should be 
pretty fast and is still slower than the Elias-Fano set.

bq. Another surprise is that Elias-Fano is best at advance() among the 
compressed sets for some cases. That means that Long.bitCount() is doing well 
on the upper bits then.

I'm looking forward for the index. :-)

bq. For bit densities  1/2 there is clear need for WAH8 and Elias-Fano to be 
able to encode the inverse set. Could that be done by a common wrapper?

I guess so.

 make it easier to plugin different bitset implementations to 
 CachingWrapperFilter
 -

 Key: LUCENE-5101
 URL: https://issues.apache.org/jira/browse/LUCENE-5101
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-5101.patch


 Currently this is possible, but its not so friendly:
 {code}
   protected DocIdSet docIdSetToCache(DocIdSet docIdSet, AtomicReader reader) 
 throws IOException {
 if (docIdSet == null) {
   // this is better than returning null, as the nonnull result can be 
 cached
   return EMPTY_DOCIDSET;
 } else if (docIdSet.isCacheable()) {
   return docIdSet;
 } else {
   final DocIdSetIterator it = docIdSet.iterator();
   // null is allowed to be returned by iterator(),
   // in this case we wrap with the sentinel set,
   // which is cacheable.
   if (it == null) {
 return EMPTY_DOCIDSET;
   } else {
 /* INTERESTING PART */
 final FixedBitSet bits = new FixedBitSet(reader.maxDoc());
 bits.or(it);
 return bits;
 /* END INTERESTING PART */
   }
 }
   }
 {code}
 Is there any value to having all this other logic in the protected API? It 
 seems like something thats not useful for a subclass... Maybe this stuff can 
 become final, and INTERESTING PART calls a simpler method, something like:
 {code}
 protected DocIdSet cacheImpl(DocIdSetIterator iterator, AtomicReader reader) {
   final FixedBitSet bits = new FixedBitSet(reader.maxDoc());
   bits.or(iterator);
   return bits;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5098) Broadword bit selection

2013-07-15 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5098:
-

Fix Version/s: 4.5

 Broadword bit selection
 ---

 Key: LUCENE-5098
 URL: https://issues.apache.org/jira/browse/LUCENE-5098
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Paul Elschot
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.5

 Attachments: LUCENE-5098.patch, LUCENE-5098.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-5098) Broadword bit selection

2013-07-15 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-5098.
--

Resolution: Fixed

 Broadword bit selection
 ---

 Key: LUCENE-5098
 URL: https://issues.apache.org/jira/browse/LUCENE-5098
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Paul Elschot
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5098.patch, LUCENE-5098.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5111) Fix WordDelimiterFilter

2013-07-15 Thread Adrien Grand (JIRA)

Adrien Grand created LUCENE-5111:


 Summary: Fix WordDelimiterFilter
 Key: LUCENE-5111
 URL: https://issues.apache.org/jira/browse/LUCENE-5111
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand


WordDelimiterFilter is documented as broken is TestRandomChains (LUCENE-4641). 
Given how used it is, we should try to fix it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5113) Allow for packing the pending values of our AppendingLongBuffers

2013-07-15 Thread Adrien Grand (JIRA)

Adrien Grand created LUCENE-5113:


 Summary: Allow for packing the pending values of our 
AppendingLongBuffers
 Key: LUCENE-5113
 URL: https://issues.apache.org/jira/browse/LUCENE-5113
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor


When working with small arrays, the pending values might require substantial 
space. So we could allow for packing the pending values in order to save space, 
the drawback being that this operation will make the buffer read-only.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5113) Allow for packing the pending values of our AppendingLongBuffers

2013-07-15 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5113:
-

Attachment: LUCENE-5113.patch

Here is a patch, there is a new freeze() method that packs the pending values 
into the (Monotonic)AppendingLongBuffer. This freeze method is used for ordinal 
maps, index sorting and FieldCache.

 Allow for packing the pending values of our AppendingLongBuffers
 

 Key: LUCENE-5113
 URL: https://issues.apache.org/jira/browse/LUCENE-5113
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5113.patch


 When working with small arrays, the pending values might require substantial 
 space. So we could allow for packing the pending values in order to save 
 space, the drawback being that this operation will make the buffer read-only.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5115) Make WAH8DocIdSet compute its cardinality at building time and use it for cost()

2013-07-15 Thread Adrien Grand (JIRA)

Adrien Grand created LUCENE-5115:


 Summary: Make WAH8DocIdSet compute its cardinality at building 
time and use it for cost()
 Key: LUCENE-5115
 URL: https://issues.apache.org/jira/browse/LUCENE-5115
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor


DocIdSetIterator.cost() accuracy can be important for the performance of some 
queries (eg.ConjunctionScorer). Since WAH8DocIdSet is immutable, we could 
compute its cardinality at building time and use it for the cost function.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-5113) Allow for packing the pending values of our AppendingLongBuffers

2013-07-16 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-5113.
--

   Resolution: Fixed
Fix Version/s: 4.5

 Allow for packing the pending values of our AppendingLongBuffers
 

 Key: LUCENE-5113
 URL: https://issues.apache.org/jira/browse/LUCENE-5113
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.5

 Attachments: LUCENE-5113.patch


 When working with small arrays, the pending values might require substantial 
 space. So we could allow for packing the pending values in order to save 
 space, the drawback being that this operation will make the buffer read-only.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5117) DISI.iterator() should never return null.

2013-07-16 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13709890#comment-13709890
 ] 

Adrien Grand commented on LUCENE-5117:
--

+1

 DISI.iterator() should never return null.
 -

 Key: LUCENE-5117
 URL: https://issues.apache.org/jira/browse/LUCENE-5117
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir

 If you have a Filter, you have to check for null twice: Filter.getDocIDSet() 
 can return a null DocIDSet, and then DocIDSet.iterator() can return a null 
 iterator.
 There is no reason for this: I think iterator() should never return null 
 (consistent with terms/postings apis).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5101) make it easier to plugin different bitset implementations to CachingWrapperFilter

2013-07-16 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5101:
-

Attachment: DocIdSetBenchmark.java

Well spotted. Maybe I did a mistake when moving the data from the benchmark 
output to the charts. I modified the program so that it outputs directly the 
input of the charts. See the updated charts at 
http://people.apache.org/~jpountz/doc_id_sets.html. I also modified it so that 
memory uses a log scale too.

 make it easier to plugin different bitset implementations to 
 CachingWrapperFilter
 -

 Key: LUCENE-5101
 URL: https://issues.apache.org/jira/browse/LUCENE-5101
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: DocIdSetBenchmark.java, LUCENE-5101.patch


 Currently this is possible, but its not so friendly:
 {code}
   protected DocIdSet docIdSetToCache(DocIdSet docIdSet, AtomicReader reader) 
 throws IOException {
 if (docIdSet == null) {
   // this is better than returning null, as the nonnull result can be 
 cached
   return EMPTY_DOCIDSET;
 } else if (docIdSet.isCacheable()) {
   return docIdSet;
 } else {
   final DocIdSetIterator it = docIdSet.iterator();
   // null is allowed to be returned by iterator(),
   // in this case we wrap with the sentinel set,
   // which is cacheable.
   if (it == null) {
 return EMPTY_DOCIDSET;
   } else {
 /* INTERESTING PART */
 final FixedBitSet bits = new FixedBitSet(reader.maxDoc());
 bits.or(it);
 return bits;
 /* END INTERESTING PART */
   }
 }
   }
 {code}
 Is there any value to having all this other logic in the protected API? It 
 seems like something thats not useful for a subclass... Maybe this stuff can 
 become final, and INTERESTING PART calls a simpler method, something like:
 {code}
 protected DocIdSet cacheImpl(DocIdSetIterator iterator, AtomicReader reader) {
   final FixedBitSet bits = new FixedBitSet(reader.maxDoc());
   bits.or(iterator);
   return bits;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2750) add Kamikaze 3.0.1 into Lucene

2013-07-16 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-2750:
-

Attachment: LUCENE-2750.patch

Updated patch: DISI.cost() now returns the cardinality of the set, computed at 
building time.

 add Kamikaze 3.0.1 into Lucene
 --

 Key: LUCENE-2750
 URL: https://issues.apache.org/jira/browse/LUCENE-2750
 Project: Lucene - Core
  Issue Type: Sub-task
  Components: modules/other
Reporter: hao yan
Assignee: Adrien Grand
 Attachments: LUCENE-2750.patch, LUCENE-2750.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 Kamikaze 3.0.1 is the updated version of Kamikaze 2.0.0. It can achieve 
 significantly better performance then Kamikaze 2.0.0 in terms of both 
 compressed size and decompression speed. The main difference between the two 
 versions is Kamikaze 3.0.x uses the much more efficient implementation of the 
 PForDelta compression algorithm. My goal is to integrate the highly efficient 
 PForDelta implementation into Lucene Codec.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Closed] (LUCENE-2949) FastVectorHighlighter FieldTermStack could likely benefit from using TermVectorMapper

2013-07-17 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand closed LUCENE-2949.


   Resolution: Won't Fix
Fix Version/s: (was: 4.4)

There is no TermVectorMapper anymore.

 FastVectorHighlighter FieldTermStack could likely benefit from using 
 TermVectorMapper
 -

 Key: LUCENE-2949
 URL: https://issues.apache.org/jira/browse/LUCENE-2949
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 3.0.3, 4.0-ALPHA
Reporter: Grant Ingersoll
Priority: Minor
  Labels: FastVectorHighlighter, Highlighter
 Attachments: LUCENE-2949.patch


 Based on my reading of the FieldTermStack constructor that loads the vector 
 from disk, we could probably save a bunch of time and memory by using the 
 TermVectorMapper callback mechanism instead of materializing the full array 
 of terms into memory and then throwing most of them out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4734) FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight

2013-07-18 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4734:
-

Attachment: LUCENE-4734.patch

Ryan, I iterated over your patch in order to be able to handle a few more 
queries, specifically phrase queries that contain gaps or have several terms at 
the same position.

It is very hard to handle all possibilities without making the highlighting 
complexity explode. I'm looking forward to LUCENE-2878 so that highlighting can 
be more efficient and doesn't need to duplicate the query interpretation logic 
anymore.

 FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight
 

 Key: LUCENE-4734
 URL: https://issues.apache.org/jira/browse/LUCENE-4734
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 4.0, 4.1, 5.0
Reporter: Ryan Lauck
  Labels: fastvectorhighlighter, highlighter
 Fix For: 4.4

 Attachments: lucene-4734.patch, LUCENE-4734.patch


 If a proximity phrase query overlaps with any other query term it will not be 
 highlighted.
 Example Text:  A B C D E F G
 Example Queries: 
 B E~10 D
 (D will be highlighted instead of B C D E)
 B E~10 C F~10
 (nothing will be highlighted)
 This can be traced to the FieldPhraseList constructor's inner while loop. 
 From the first example query, the first TermInfo popped off the stack will be 
 B. The second TermInfo will be D which will not be found in the submap 
 for B E~10 and will trigger a failed match.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Closed] (LUCENE-4118) FastVectorHighlighter fail to highlight taking in input some proximity query.

2013-07-18 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand closed LUCENE-4118.


Resolution: Duplicate

Duplicate of LUCENE-4734

 FastVectorHighlighter fail to highlight taking in input some proximity query.
 -

 Key: LUCENE-4118
 URL: https://issues.apache.org/jira/browse/LUCENE-4118
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 3.4, 5.0
Reporter: Emanuele Lombardi
Assignee: Koji Sekiguchi
  Labels: FastVectorHighlighter
 Attachments: FVHPatch.txt


 There are 2 related bug with proximity query
 1) In a phrase there are n repeated terms the FVH module fails to highlight 
 that.
 see testRepeatedTermsWithSlop
 2) If you search the terms reversed the FVH module fails to highlight that.
 see testReversedTermsWithSlop

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4542) Make RECURSION_CAP in HunspellStemmer configurable

2013-07-18 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4542:
-

Assignee: Adrien Grand  (was: Chris Male)

 Make RECURSION_CAP in HunspellStemmer configurable
 --

 Key: LUCENE-4542
 URL: https://issues.apache.org/jira/browse/LUCENE-4542
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.0
Reporter: Piotr
Assignee: Adrien Grand
 Attachments: Lucene-4542-javadoc.patch, LUCENE-4542.patch, 
 LUCENE-4542-with-solr.patch


 Currently there is 
 private static final int RECURSION_CAP = 2;
 in the code of the class HunspellStemmer. It makes using hunspell with 
 several dictionaries almost unusable, due to bad performance (f.ex. it costs 
 36ms to stem long sentence in latvian for recursion_cap=2 and 5 ms for 
 recursion_cap=1). It would be nice to be able to tune this number as needed.
 AFAIK this number (2) was chosen arbitrary.
 (it's a first issue in my life, so please forgive me any mistakes done).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-4542) Make RECURSION_CAP in HunspellStemmer configurable

2013-07-18 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4542.
--

   Resolution: Fixed
Fix Version/s: 4.5

Committed, thanks!

 Make RECURSION_CAP in HunspellStemmer configurable
 --

 Key: LUCENE-4542
 URL: https://issues.apache.org/jira/browse/LUCENE-4542
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.0
Reporter: Piotr
Assignee: Adrien Grand
 Fix For: 4.5

 Attachments: Lucene-4542-javadoc.patch, LUCENE-4542.patch, 
 LUCENE-4542-with-solr.patch


 Currently there is 
 private static final int RECURSION_CAP = 2;
 in the code of the class HunspellStemmer. It makes using hunspell with 
 several dictionaries almost unusable, due to bad performance (f.ex. it costs 
 36ms to stem long sentence in latvian for recursion_cap=2 and 5 ms for 
 recursion_cap=1). It would be nice to be able to tune this number as needed.
 AFAIK this number (2) was chosen arbitrary.
 (it's a first issue in my life, so please forgive me any mistakes done).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5119) DiskDV SortedDocValues shouldnt hold doc-to-ord in heap memory

2013-07-19 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713408#comment-13713408
 ] 

Adrien Grand commented on LUCENE-5119:
--

+1 I think it makes sense to make DiskDV deserve its name and store everything 
on disk.

 DiskDV SortedDocValues shouldnt hold doc-to-ord in heap memory
 --

 Key: LUCENE-5119
 URL: https://issues.apache.org/jira/browse/LUCENE-5119
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
 Attachments: LUCENE-5119.patch


 These are accessed sequentially when e.g. faceting, and can be a fairly large 
 amount of data (based on # of docs and # of unique terms). 
 I think this was done so that conceptually random access to a specific 
 docid would be faster than eg. stored fields, but I think we should instead 
 target the DV datastructures towards real use cases 
 (faceting,sorting,grouping,...)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5119) DiskDV SortedDocValues shouldnt hold doc-to-ord in heap memory

2013-07-19 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713414#comment-13713414
 ] 

Adrien Grand commented on LUCENE-5119:
--

David, I think your use-case would still work pretty well with this change. In 
particular, if you had enough memory to store your ordinals mapping in memory, 
this means that the file-system cache will likely be able to cache the whole 
ordinals mapping as well (you may just need to decrease a little the amount of 
memory given the the JVM) so random access should remain fast?

 DiskDV SortedDocValues shouldnt hold doc-to-ord in heap memory
 --

 Key: LUCENE-5119
 URL: https://issues.apache.org/jira/browse/LUCENE-5119
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
 Attachments: LUCENE-5119.patch


 These are accessed sequentially when e.g. faceting, and can be a fairly large 
 amount of data (based on # of docs and # of unique terms). 
 I think this was done so that conceptually random access to a specific 
 docid would be faster than eg. stored fields, but I think we should instead 
 target the DV datastructures towards real use cases 
 (faceting,sorting,grouping,...)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-5115) Make WAH8DocIdSet compute its cardinality at building time and use it for cost()

2013-07-19 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-5115.
--

   Resolution: Fixed
Fix Version/s: 4.5

 Make WAH8DocIdSet compute its cardinality at building time and use it for 
 cost()
 

 Key: LUCENE-5115
 URL: https://issues.apache.org/jira/browse/LUCENE-5115
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.5

 Attachments: LUCENE-5115.patch


 DocIdSetIterator.cost() accuracy can be important for the performance of some 
 queries (eg.ConjunctionScorer). Since WAH8DocIdSet is immutable, we could 
 compute its cardinality at building time and use it for the cost function.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5057) Hunspell stemmer generates multiple tokens

2013-07-19 Thread Adrien Grand (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Adrien Grand updated LUCENE-5057:
-

Assignee: Adrien Grand

Hunspell stemmer generates multiple tokens
--

Key: LUCENE-5057
URL: https://issues.apache.org/jira/browse/LUCENE-5057
Project: Lucene - Core
Issue Type: Improvement
Affects Versions: 4.3
Reporter: Luca Cavanna
Assignee: Adrien Grand

The hunspell stemmer seems to be generating multiple tokens: the original
token plus the available stems.
It might be a good thing in some cases but it seems to be a different
behaviour compared to the other stemmers and causes problems as well. I would
rather have an option to decide whether it should output only the available
stems, or the stems plus the original token. I'm not sure though if it's
possible to have only a single stem indexed, which would be even better in my
opinion. When I look at how snowball works only one token is indexed, the
stem, and that works great. Probably there's something I'm missing in how
hunspell works.
Here is my issue: I have a query composed of multiple terms, which is
analyzed using stemming and a boolean query is generated out of it. All fine
when adding all clauses as should (OR operator), but if I add all clauses as
must (AND operator), then I can get back only the documents that contain the
stem originated by the exactly same original word.
Example for the dutch language I'm working with: fiets (means bicycle in
dutch), its plural is fietsen.
If I index fietsen I get both fietsen and fiets indexed, but if I index
fiets I get the only fiets indexed.
When I query for fietsen whatever I get the following boolean query:
field:fiets field:fietsen field:whatever.
If I apply the AND operator and use must clauses for each subquery, then I
can only find the documents that originally contained fietsen, not the ones
that originally contained fiets, which is not really what stemming is about.
Any thoughts on this? I also wonder if it can be a dictionary issue since I
see that different words that have the word fiets as root don't get the
same stems, and using the AND operator at query time is a big issue.
I would love to contribute on this and looking forward to your feedback.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4734) FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight

2013-07-19 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713480#comment-13713480
 ] 

Adrien Grand commented on LUCENE-4734:
--

Hey Ryan, I think the use-case you are describing will be possible. However 
this will require some care because offsets computed by Lucene's analysis API 
are offsets for UTF16-encoded content (Java's internal encoding). So if your 
client code' programming language has a different internal encoding, you will 
need to perform conversions (this is not a fundamental problem, just something 
to be aware of in order not to get bad surprises).

 FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight
 

 Key: LUCENE-4734
 URL: https://issues.apache.org/jira/browse/LUCENE-4734
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 4.0, 4.1, 5.0
Reporter: Ryan Lauck
  Labels: fastvectorhighlighter, highlighter
 Fix For: 4.4

 Attachments: lucene-4734.patch, LUCENE-4734.patch


 If a proximity phrase query overlaps with any other query term it will not be 
 highlighted.
 Example Text:  A B C D E F G
 Example Queries: 
 B E~10 D
 (D will be highlighted instead of B C D E)
 B E~10 C F~10
 (nothing will be highlighted)
 This can be traced to the FieldPhraseList constructor's inner while loop. 
 From the first example query, the first TermInfo popped off the stack will be 
 B. The second TermInfo will be D which will not be found in the submap 
 for B E~10 and will trigger a failed match.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-5057) Hunspell stemmer generates multiple tokens

2013-07-19 Thread Adrien Grand (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Adrien Grand resolved LUCENE-5057.
--

Resolution: Won't Fix

I checked with Luca and this is a dictionary issue, fietsen and fiets are both
considered as stems of fietsen with the dutch dictionary.

For people who have stemming issues, this is very easy to check whether the
issue is in Lucene or in the dictionary by installing hunspell-tools (apt-get
install hunspell-tools on Debian and related distributions) and running:
{noformat}
% echo fietsen tmp
% /usr/lib/hunspell/analyze nl_NL.aff nl_NL.dic tmp
fietsen
analyze(fietsen) = st:fietsen
analyze(fietsen) = st:fiets fl:N
stem(fietsen) = fietsen
stem(fietsen) = fiets
{noformat}

In this particular case, we can see that fietsen is both a stem (1st line) and
a variation of fiets with the affix identified with N.

Hunspell stemmer generates multiple tokens
--

Key: LUCENE-5057
URL: https://issues.apache.org/jira/browse/LUCENE-5057
Project: Lucene - Core
Issue Type: Improvement
Affects Versions: 4.3
Reporter: Luca Cavanna
Assignee: Adrien Grand

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4734) FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight

2013-07-19 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4734:
-

Assignee: Adrien Grand

 FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight
 

 Key: LUCENE-4734
 URL: https://issues.apache.org/jira/browse/LUCENE-4734
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 4.0, 4.1, 5.0
Reporter: Ryan Lauck
Assignee: Adrien Grand
  Labels: fastvectorhighlighter, highlighter
 Fix For: 4.4

 Attachments: lucene-4734.patch, LUCENE-4734.patch


 If a proximity phrase query overlaps with any other query term it will not be 
 highlighted.
 Example Text:  A B C D E F G
 Example Queries: 
 B E~10 D
 (D will be highlighted instead of B C D E)
 B E~10 C F~10
 (nothing will be highlighted)
 This can be traced to the FieldPhraseList constructor's inner while loop. 
 From the first example query, the first TermInfo popped off the stack will be 
 B. The second TermInfo will be D which will not be found in the submap 
 for B E~10 and will trigger a failed match.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-4734) FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight

2013-07-19 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4734.
--

Resolution: Fixed

 FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight
 

 Key: LUCENE-4734
 URL: https://issues.apache.org/jira/browse/LUCENE-4734
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 4.0, 4.1, 5.0
Reporter: Ryan Lauck
Assignee: Adrien Grand
  Labels: fastvectorhighlighter, highlighter
 Fix For: 4.4

 Attachments: lucene-4734.patch, LUCENE-4734.patch


 If a proximity phrase query overlaps with any other query term it will not be 
 highlighted.
 Example Text:  A B C D E F G
 Example Queries: 
 B E~10 D
 (D will be highlighted instead of B C D E)
 B E~10 C F~10
 (nothing will be highlighted)
 This can be traced to the FieldPhraseList constructor's inner while loop. 
 From the first example query, the first TermInfo popped off the stack will be 
 B. The second TermInfo will be D which will not be found in the submap 
 for B E~10 and will trigger a failed match.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5122) DiskDV probably shouldnt use BlockPackedReader for SortedDV doc-to-ord

2013-07-19 Thread Adrien Grand (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713670#comment-13713670
]

Adrien Grand commented on LUCENE-5122:
--

For SortingMP, we only provide the ability to sort by a NumericDocValues field
out of the box because numbers feel more natural to define a static rank.

Maybe another case where BlockPackedReader could help is if almost all
documents have the same value. In that case BlockPackedReader will be able to
require 0 bits per value for all blocks that contain a single unique value.

But I agree PackedInts would likely better in general and remove one level of
indirection.

DiskDV probably shouldnt use BlockPackedReader for SortedDV doc-to-ord
--

Key: LUCENE-5122
URL: https://issues.apache.org/jira/browse/LUCENE-5122
Project: Lucene - Core
Issue Type: Improvement
Reporter: Robert Muir

I dont think blocking provides any benefit here in general. we can assume
the ordinals are essentially random and since SortedDV is single-valued, its
probably better to just use the simpler packedints directly?
I guess the only case where it would help is if you sorted your segments by
that DV field. But that seems kinda wierd/esoteric to sort your index by a
deref'ed string value, e.g. I don't think its even supported by SortingMP.
For the SortedSet ord stream, this can exceed 2B values so for now I think
it should stay as blockpackedreader. but it could use a large blocksize...

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5124) fix+document+rename DiskDV to Lucene45

2013-07-19 Thread Adrien Grand (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713856#comment-13713856
]

Adrien Grand commented on LUCENE-5124:
--

fix+document+rename DiskDV to Lucene45
--

Key: LUCENE-5124
URL: https://issues.apache.org/jira/browse/LUCENE-5124
Project: Lucene - Core
Issue Type: New Feature
Affects Versions: 4.5
Reporter: Robert Muir

The idea is that the default implementation should not hold everything in
memory, we can have a Memory impl for that. I think stuff being all in heap
memory is just a relic of FieldCache.
In my benchmarking diskdv works well, and its much easier to manage (keep a
smaller heap, leave it to the OS, no OOMs etc from merging large FSTs, ...)
If someone wants to optimize by forcing everything in memory, they can then
use the usual approach (e.g. just use FileSwitchDirectory, or pick Memory
for even more efficient stuff).
Ill keep the issue here for a bit. If we decide to do this, ill work up file
format docs and so on. We should also fix a few things that are not great
about it (LUCENE-5122) before making it the default.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Reopened] (LUCENE-4734) FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight

2013-07-22 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand reopened LUCENE-4734:
--


 FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight
 

 Key: LUCENE-4734
 URL: https://issues.apache.org/jira/browse/LUCENE-4734
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 4.0, 4.1, 5.0
Reporter: Ryan Lauck
Assignee: Adrien Grand
  Labels: fastvectorhighlighter, highlighter
 Fix For: 4.4

 Attachments: LUCENE-4734-2.patch, lucene-4734.patch, LUCENE-4734.patch


 If a proximity phrase query overlaps with any other query term it will not be 
 highlighted.
 Example Text:  A B C D E F G
 Example Queries: 
 B E~10 D
 (D will be highlighted instead of B C D E)
 B E~10 C F~10
 (nothing will be highlighted)
 This can be traced to the FieldPhraseList constructor's inner while loop. 
 From the first example query, the first TermInfo popped off the stack will be 
 B. The second TermInfo will be D which will not be found in the submap 
 for B E~10 and will trigger a failed match.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4734) FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight

2013-07-22 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4734:
-

Attachment: LUCENE-4734-2.patch

The approach I used can be memory-intensive when there are many positions that 
have several terms, here is a fix.

 FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight
 

 Key: LUCENE-4734
 URL: https://issues.apache.org/jira/browse/LUCENE-4734
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 4.0, 4.1, 5.0
Reporter: Ryan Lauck
Assignee: Adrien Grand
  Labels: fastvectorhighlighter, highlighter
 Fix For: 4.4

 Attachments: LUCENE-4734-2.patch, lucene-4734.patch, LUCENE-4734.patch


 If a proximity phrase query overlaps with any other query term it will not be 
 highlighted.
 Example Text:  A B C D E F G
 Example Queries: 
 B E~10 D
 (D will be highlighted instead of B C D E)
 B E~10 C F~10
 (nothing will be highlighted)
 This can be traced to the FieldPhraseList constructor's inner while loop. 
 From the first example query, the first TermInfo popped off the stack will be 
 B. The second TermInfo will be D which will not be found in the submap 
 for B E~10 and will trigger a failed match.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4734) FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight

2013-07-22 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715461#comment-13715461
 ] 

Adrien Grand commented on LUCENE-4734:
--

I agree this seems wasteful. Maybe we could open an issue about it?

 FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight
 

 Key: LUCENE-4734
 URL: https://issues.apache.org/jira/browse/LUCENE-4734
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 4.0, 4.1, 5.0
Reporter: Ryan Lauck
Assignee: Adrien Grand
  Labels: fastvectorhighlighter, highlighter
 Fix For: 4.4

 Attachments: LUCENE-4734-2.patch, lucene-4734.patch, LUCENE-4734.patch


 If a proximity phrase query overlaps with any other query term it will not be 
 highlighted.
 Example Text:  A B C D E F G
 Example Queries: 
 B E~10 D
 (D will be highlighted instead of B C D E)
 B E~10 C F~10
 (nothing will be highlighted)
 This can be traced to the FieldPhraseList constructor's inner while loop. 
 From the first example query, the first TermInfo popped off the stack will be 
 B. The second TermInfo will be D which will not be found in the submap 
 for B E~10 and will trigger a failed match.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5131) CheckIndex is confusing for docvalues fields

2013-07-25 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719404#comment-13719404
 ] 

Adrien Grand commented on LUCENE-5131:
--

Definitely +1 for this patch and printing statistics about unique value counts 
for SORTED and SORTED_SET.

 CheckIndex is confusing for docvalues fields
 

 Key: LUCENE-5131
 URL: https://issues.apache.org/jira/browse/LUCENE-5131
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
 Attachments: LUCENE-5131.patch, LUCENE-5131.patch


 it prints things like:
 {noformat}
 test: docvalues...OK [0 total doc count; 18 docvalues fields]
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4876) IndexWriterConfig.clone should clone the MergeScheduler

2013-07-26 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13720476#comment-13720476
 ] 

Adrien Grand commented on LUCENE-4876:
--

bq. We keep clone() on IWC, and the rest of the objects, and tell users that 
it's their responsibility to call IWC.clone() before passing to IW? That's line 
a 1-liner change (well + clarifying the jdocs), that will make 99% of the users 
happy. The rest should just do new IW(dir, conf.clone()) ... that's simple 
enough?

Even though most users probably don't reuse their IndexWriterConfig objects, 
doing so should be safe and I'm a little scared of what could happen if a 
ConcurrentMergeScheduler was mistakenly shared by two different IndexWriters 
for example.

Maybe another option for this issue would be to replace all these objects 
(MergePolicy, MergeScheduler, etc.) in IndexWriterConfig by factories for these 
objects that accept an IndexWriter as an argument (and maybe other objects 
depending on the factory). This would make it clear that IndexWriter has its 
own instance of these objects and reusing IndexWriterConfig instances would 
still be safe. An interesting side-effect is that we wouldn't need these 
SetOnce? in DWPT, FlushPolicy, and MergePolicy anymore, and 
ConcurrentMergeScheduler.indexWriter could be made final.

 IndexWriterConfig.clone should clone the MergeScheduler
 ---

 Key: LUCENE-4876
 URL: https://issues.apache.org/jira/browse/LUCENE-4876
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.3

 Attachments: LUCENE-4876.patch, LUCENE-4876.patch, LUCENE-4876.patch


 ConcurrentMergeScheduler has a ListMergeThread member to track the running 
 merging threads, so IndexWriterConfig.clone should clone the merge scheduler 
 so that both IndexWriterConfig instances are independant.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4876) IndexWriterConfig.clone should clone the MergeScheduler

2013-07-26 Thread Adrien Grand (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13720526#comment-13720526
]

Adrien Grand commented on LUCENE-4876:
--

bq. This is currently impossible because of SetOnce.

The merge schedulers don't have a SetOnceIndexWriter so if a user replaces
the MergePolicy and all objects that have a SetOnce in its IndexWriterConfig
and forgets the merge scheduler, the problem remains.

I don't really like this SetOnce? trick. If a variable should only be set
once, it should be final and set in the constructor?

bq. how cruel it is to expose clone semantics on end-users

I fully agree. In this issue I tried to make clone consistently used across
stateful objects held by an IndexWriterConfig object but ideally
IndexWriterConfig should only carry stateless objects (in particular none of
them should have an IndexWriter as a member) so that we never need to clone it
or any of its members when reusing it.

IndexWriterConfig.clone should clone the MergeScheduler
---

Key: LUCENE-4876
URL: https://issues.apache.org/jira/browse/LUCENE-4876
Project: Lucene - Core
Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Fix For: 4.3

Attachments: LUCENE-4876.patch, LUCENE-4876.patch, LUCENE-4876.patch

ConcurrentMergeScheduler has a ListMergeThread member to track the running
merging threads, so IndexWriterConfig.clone should clone the merge scheduler
so that both IndexWriterConfig instances are independant.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4876) IndexWriterConfig.clone should clone the MergeScheduler

2013-07-26 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13720806#comment-13720806
 ] 

Adrien Grand commented on LUCENE-4876:
--

The SetOnceIndexWriter on IWC addresses my main concern. Thanks Shai!

 IndexWriterConfig.clone should clone the MergeScheduler
 ---

 Key: LUCENE-4876
 URL: https://issues.apache.org/jira/browse/LUCENE-4876
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.3

 Attachments: LUCENE-4876.patch, LUCENE-4876.patch, LUCENE-4876.patch, 
 LUCENE-4876.patch, LUCENE-4876.patch


 ConcurrentMergeScheduler has a ListMergeThread member to track the running 
 merging threads, so IndexWriterConfig.clone should clone the merge scheduler 
 so that both IndexWriterConfig instances are independant.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5127) FixedGapTermsIndex should use monotonic compression

2013-07-26 Thread Adrien Grand (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13720856#comment-13720856
]

Adrien Grand commented on LUCENE-5127:
--

This is a very nice cleanup! In FixedGapTermsIndexWriter, I think we could
improve the buffering of offsets and addresses by directly buffering into a
MonotonicBlockPackedWriter over a RamOutputStream, and then copy the raw
content of the RamOutputStream to the IndexOutput? This would avoid an extra
encoding/decoding step.

FixedGapTermsIndex should use monotonic compression
---

Key: LUCENE-5127
URL: https://issues.apache.org/jira/browse/LUCENE-5127
Project: Lucene - Core
Issue Type: Improvement
Reporter: Robert Muir
Attachments: LUCENE-5127.patch, LUCENE-5127.patch, LUCENE-5127.patch

for the addresses in the big in-memory byte[] and disk blocks, we could save
a good deal of RAM here.
I think this codec just never got upgraded when we added these new packed
improvements, but it might be interesting to try to use for the terms data of
sorted/sortedset DV implementations.
patch works, but has nocommits and currently ignores the divisor. The
annoying problem there being that we have the shared interface with
get(int) for PackedInts.Mutable/Reader, but no equivalent base class for
monotonics get(long)...
Still its enough that we could benchmark/compare for now.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5140) Slowdown of the span queries caused by LUCENE-4946

2013-07-26 Thread Adrien Grand (JIRA)

Adrien Grand created LUCENE-5140:


 Summary: Slowdown of the span queries caused by LUCENE-4946
 Key: LUCENE-5140
 URL: https://issues.apache.org/jira/browse/LUCENE-5140
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor


[~romseygeek] noticed that span queries have been slower since LUCENE-4946 got 
committed.

http://people.apache.org/~mikemccand/lucenebench/SpanNear.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5140) Slowdown of the span queries caused by LUCENE-4946

2013-07-26 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5140:
-

Attachment: LUCENE-5140.patch

I think it is due to some overhead of our TimSorter implementation for small 
arrays. Here is a patch that replaces TimSorter with InPlaceMergeSorter, which 
should perform better on very small arrays but still has optimizations for 
sorted content, eg. merging two sorted slices is a no-op if the highest element 
from the 1st slice is lower than the least element from the 2nd slice. 
luceneutil seems to be happy with this patch (left is trunk, right is with 
patch applied):
{noformat}
 LowSpanNear  143.65  (4.5%)  157.75  (3.9%)
9.8% (   1% -   19%)
HighSpanNear5.47  (4.4%)6.20  (9.7%)   
13.4% (   0% -   28%)
 MedSpanNear   94.27  (3.7%)  107.51  (3.7%)   
14.1% (   6% -   22%)
{noformat}

 Slowdown of the span queries caused by LUCENE-4946
 --

 Key: LUCENE-5140
 URL: https://issues.apache.org/jira/browse/LUCENE-5140
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5140.patch


 [~romseygeek] noticed that span queries have been slower since LUCENE-4946 
 got committed.
 http://people.apache.org/~mikemccand/lucenebench/SpanNear.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5141) CheckIndex.fixIndex doesn't need a Codec

2013-07-26 Thread Adrien Grand (JIRA)

Adrien Grand created LUCENE-5141:


 Summary: CheckIndex.fixIndex doesn't need a Codec
 Key: LUCENE-5141
 URL: https://issues.apache.org/jira/browse/LUCENE-5141
 Project: Lucene - Core
  Issue Type: Task
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial


CheckIndex.fixIndex takes a codec as an argument although it doesn't need one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5141) CheckIndex.fixIndex doesn't need a Codec

2013-07-26 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5141:
-

Attachment: LUCENE-5141.patch

Patch removing Codec from the arguments of CheckIndex.fixIndex.

 CheckIndex.fixIndex doesn't need a Codec
 

 Key: LUCENE-5141
 URL: https://issues.apache.org/jira/browse/LUCENE-5141
 Project: Lucene - Core
  Issue Type: Task
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Attachments: LUCENE-5141.patch


 CheckIndex.fixIndex takes a codec as an argument although it doesn't need one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-4734) FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight

2013-07-28 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4734.
--

Resolution: Fixed

 FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight
 

 Key: LUCENE-4734
 URL: https://issues.apache.org/jira/browse/LUCENE-4734
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 4.0, 4.1, 5.0
Reporter: Ryan Lauck
Assignee: Adrien Grand
  Labels: fastvectorhighlighter, highlighter
 Fix For: 5.0, 4.5

 Attachments: LUCENE-4734-2.patch, lucene-4734.patch, LUCENE-4734.patch


 If a proximity phrase query overlaps with any other query term it will not be 
 highlighted.
 Example Text:  A B C D E F G
 Example Queries: 
 B E~10 D
 (D will be highlighted instead of B C D E)
 B E~10 C F~10
 (nothing will be highlighted)
 This can be traced to the FieldPhraseList constructor's inner while loop. 
 From the first example query, the first TermInfo popped off the stack will be 
 B. The second TermInfo will be D which will not be found in the submap 
 for B E~10 and will trigger a failed match.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-5141) CheckIndex.fixIndex doesn't need a Codec

2013-07-28 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-5141.
--

Resolution: Fixed

 CheckIndex.fixIndex doesn't need a Codec
 

 Key: LUCENE-5141
 URL: https://issues.apache.org/jira/browse/LUCENE-5141
 Project: Lucene - Core
  Issue Type: Task
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Attachments: LUCENE-5141.patch


 CheckIndex.fixIndex takes a codec as an argument although it doesn't need one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5141) CheckIndex.fixIndex doesn't need a Codec

2013-07-29 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5141:
-

Fix Version/s: 4.5
   5.0

 CheckIndex.fixIndex doesn't need a Codec
 

 Key: LUCENE-5141
 URL: https://issues.apache.org/jira/browse/LUCENE-5141
 Project: Lucene - Core
  Issue Type: Task
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5141.patch


 CheckIndex.fixIndex takes a codec as an argument although it doesn't need one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5145) Added AppendingPackedLongBuffer extended AbstractAppendingLongBuffer family (customizable compression ratio + bulk retrieval)

2013-07-29 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5145:
-

Assignee: Adrien Grand

 Added AppendingPackedLongBuffer  extended AbstractAppendingLongBuffer family 
 (customizable compression ratio + bulk retrieval)
 ---

 Key: LUCENE-5145
 URL: https://issues.apache.org/jira/browse/LUCENE-5145
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Boaz Leskes
Assignee: Adrien Grand
 Attachments: LUCENE-5145.patch


 Made acceptableOverheadRatio configurable 
 Added bulk get to AbstractAppendingLongBuffer classes, for faster retrieval.
 Introduced a new variant, AppendingPackedLongBuffer which solely relies on 
 PackedInts as a back-end. This new class is useful where people have 
 non-negative numbers with a fairly uniform distribution over a fixed 
 (limited) range. Ex. facets ordinals.
 To distinguish it from AppendingPackedLongBuffer, delta based 
 AppendingLongBuffer was renamed to AppendingDeltaPackedLongBuffer
 Fixed an Issue with NullReader where it didn't respect it's valueCount in 
 bulk gets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5148) SortedSetDocValues caching / state

2013-07-29 Thread Adrien Grand (JIRA)

Adrien Grand created LUCENE-5148:


 Summary: SortedSetDocValues caching / state
 Key: LUCENE-5148
 URL: https://issues.apache.org/jira/browse/LUCENE-5148
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Priority: Minor


I just spent some time digging into a bug which was due to the fact that 
SORTED_SET doc values are stateful (setDocument/nextOrd) and are cached per 
thread. So if you try to get two instances from the same field in the same 
thread, you will actually get the same instance and won't be able to iterate 
over ords of two documents in parallel.

This is not necessarily a bug, this behavior can be documented, but I think it 
would be nice if the API could prevent from such mistakes by storing the state 
in a separate object or cloning the SortedSetDocValues object in 
SegmentCoreReaders.getSortedSetDocValues?

What do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5127) FixedGapTermsIndex should use monotonic compression

2013-07-29 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13722649#comment-13722649
 ] 

Adrien Grand commented on LUCENE-5127:
--

+1

 FixedGapTermsIndex should use monotonic compression
 ---

 Key: LUCENE-5127
 URL: https://issues.apache.org/jira/browse/LUCENE-5127
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-5127.patch, LUCENE-5127.patch, LUCENE-5127.patch, 
 LUCENE-5127.patch


 for the addresses in the big in-memory byte[] and disk blocks, we could save 
 a good deal of RAM here.
 I think this codec just never got upgraded when we added these new packed 
 improvements, but it might be interesting to try to use for the terms data of 
 sorted/sortedset DV implementations.
 patch works, but has nocommits and currently ignores the divisor. The 
 annoying problem there being that we have the shared interface with 
 get(int) for PackedInts.Mutable/Reader, but no equivalent base class for 
 monotonics get(long)... 
 Still its enough that we could benchmark/compare for now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5145) Added AppendingPackedLongBuffer extended AbstractAppendingLongBuffer family (customizable compression ratio + bulk retrieval)

2013-07-29 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13722770#comment-13722770
 ] 

Adrien Grand commented on LUCENE-5145:
--

Thanks Boaz, the patch looks very good!
 - I like the fact that the addition of the new bulk API helped make fillValues 
final!
 - OrdinalMap.subIndexes, SortedDocValuesWriter.pending and 
SortedSetDocValuesWriter.pending are 0-based so they could use the new 
{{AppendingPackedLongBuffer}} instead of {{AppendingDeltaPackedLongBuffer}}, 
can you update the patch?



 Added AppendingPackedLongBuffer  extended AbstractAppendingLongBuffer family 
 (customizable compression ratio + bulk retrieval)
 ---

 Key: LUCENE-5145
 URL: https://issues.apache.org/jira/browse/LUCENE-5145
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Boaz Leskes
Assignee: Adrien Grand
 Attachments: LUCENE-5145.patch


 Made acceptableOverheadRatio configurable 
 Added bulk get to AbstractAppendingLongBuffer classes, for faster retrieval.
 Introduced a new variant, AppendingPackedLongBuffer which solely relies on 
 PackedInts as a back-end. This new class is useful where people have 
 non-negative numbers with a fairly uniform distribution over a fixed 
 (limited) range. Ex. facets ordinals.
 To distinguish it from AppendingPackedLongBuffer, delta based 
 AppendingLongBuffer was renamed to AppendingDeltaPackedLongBuffer
 Fixed an Issue with NullReader where it didn't respect it's valueCount in 
 bulk gets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5150) WAH8DocIdSet: dense sets compression

2013-07-29 Thread Adrien Grand (JIRA)

Adrien Grand created LUCENE-5150:


 Summary: WAH8DocIdSet: dense sets compression
 Key: LUCENE-5150
 URL: https://issues.apache.org/jira/browse/LUCENE-5150
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5150) WAH8DocIdSet: dense sets compression

2013-07-29 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5150:
-

Description: In LUCENE-5101, Paul Elschot mentioned that it would be 
interesting to be able to encode the inverse set to also compress very dense 
sets.

 WAH8DocIdSet: dense sets compression
 

 Key: LUCENE-5150
 URL: https://issues.apache.org/jira/browse/LUCENE-5150
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial

 In LUCENE-5101, Paul Elschot mentioned that it would be interesting to be 
 able to encode the inverse set to also compress very dense sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5150) WAH8DocIdSet: dense sets compression

2013-07-29 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5150:
-

Attachment: LUCENE-5150.patch

Here is a patch. It reserves an additional bit in the header to say whether the 
encoding should be inversed (meaning clean words are actually 0xFF instead of 
0x00).

It should reduce the amount of memory required to build and store dense sets. 
In spite of this change, compression ratios remain the same for sparse sets.

For random dense sets, I observed compression ratios of 87% when the load 
factor is 90% and 20% when the load factor is 99% (vs. 100% before).

 WAH8DocIdSet: dense sets compression
 

 Key: LUCENE-5150
 URL: https://issues.apache.org/jira/browse/LUCENE-5150
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Attachments: LUCENE-5150.patch


 In LUCENE-5101, Paul Elschot mentioned that it would be interesting to be 
 able to encode the inverse set to also compress very dense sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-5145) Added AppendingPackedLongBuffer extended AbstractAppendingLongBuffer family (customizable compression ratio + bulk retrieval)

2013-07-30 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-5145.
--

   Resolution: Fixed
Fix Version/s: 4.5
   5.0

Committed. Thanks Boaz!

 Added AppendingPackedLongBuffer  extended AbstractAppendingLongBuffer family 
 (customizable compression ratio + bulk retrieval)
 ---

 Key: LUCENE-5145
 URL: https://issues.apache.org/jira/browse/LUCENE-5145
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Boaz Leskes
Assignee: Adrien Grand
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5145.patch, LUCENE-5145.v2.patch


 Made acceptableOverheadRatio configurable 
 Added bulk get to AbstractAppendingLongBuffer classes, for faster retrieval.
 Introduced a new variant, AppendingPackedLongBuffer which solely relies on 
 PackedInts as a back-end. This new class is useful where people have 
 non-negative numbers with a fairly uniform distribution over a fixed 
 (limited) range. Ex. facets ordinals.
 To distinguish it from AppendingPackedLongBuffer, delta based 
 AppendingLongBuffer was renamed to AppendingDeltaPackedLongBuffer
 Fixed an Issue with NullReader where it didn't respect it's valueCount in 
 bulk gets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5153) Allow wrapping Reader from AnalyzerWrapper

2013-07-30 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723992#comment-13723992
 ] 

Adrien Grand commented on LUCENE-5153:
--

I think this is the right thing? On the opposite, if wrapReader inserted char 
filters at the end of the charfilter chain, the behavior of the wrapper 
analyzer would be altered (it would allow to insert something between the first 
CharFilter and the last TokenFilter of the wrapped analyzer).

 Allow wrapping Reader from AnalyzerWrapper
 --

 Key: LUCENE-5153
 URL: https://issues.apache.org/jira/browse/LUCENE-5153
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: Shai Erera
Assignee: Shai Erera
 Attachments: LUCENE-5153.patch


 It can be useful to allow AnalyzerWrapper extensions to wrap the Reader given 
 to initReader, e.g. with a CharFilter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5140) Slowdown of the span queries caused by LUCENE-4946

2013-07-30 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13724003#comment-13724003
 ] 

Adrien Grand commented on LUCENE-5140:
--

I will commit the patch as-is soon and have a look at the lucenebench reports 
in the next days if there is no objection.

 Slowdown of the span queries caused by LUCENE-4946
 --

 Key: LUCENE-5140
 URL: https://issues.apache.org/jira/browse/LUCENE-5140
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5140.patch


 [~romseygeek] noticed that span queries have been slower since LUCENE-4946 
 got committed.
 http://people.apache.org/~mikemccand/lucenebench/SpanNear.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-5140) Slowdown of the span queries caused by LUCENE-4946

2013-07-31 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-5140.
--

   Resolution: Fixed
Fix Version/s: 4.5
   5.0

Committed. I will have a look at lucenebench in the next few days.

 Slowdown of the span queries caused by LUCENE-4946
 --

 Key: LUCENE-5140
 URL: https://issues.apache.org/jira/browse/LUCENE-5140
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5140.patch


 [~romseygeek] noticed that span queries have been slower since LUCENE-4946 
 got committed.
 http://people.apache.org/~mikemccand/lucenebench/SpanNear.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-4634) PackedInts: streaming API that supports variable numbers of bits per value

2012-12-18 Thread Adrien Grand (JIRA)

Adrien Grand created LUCENE-4634:


 Summary: PackedInts: streaming API that supports variable numbers 
of bits per value
 Key: LUCENE-4634
 URL: https://issues.apache.org/jira/browse/LUCENE-4634
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor


It could be convenient to have a streaming API (writers and iterators, no 
random access) that supports variable numbers of bits per value. Although this 
would be much slower than the current fixed-size APIs, it could help save bytes 
in our codec formats.

The API could look like:
{code}
Iterator {
  long next(int bitsPerValue);
}

Writer {
  void write(long value, int bitsPerValue); // assert 
PackedInts.bitsRequired(value) = bitsPerValue;
}
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4599) Compressed term vectors

2012-12-18 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13535079#comment-13535079
 ] 

Adrien Grand commented on LUCENE-4599:
--

Hey Shawn, I'm still working actively on this issue. I made good progress 
regarding compression ratio but term vectors are more complicated than stored 
fields (with lots of corner cases like negative start offsets, negative 
lengths, fields that don't always have the same options, etc.) so I will need 
time and lots of Jenkins builds to feel comfortable making it the default term 
vectors impl. It will depend on the 4.1 release schedule but given that it's 
likely to comme rather soon and that I will have very little time to work on 
this issue until next month it will probably only make it to 4.2.

 Compressed term vectors
 ---

 Key: LUCENE-4599
 URL: https://issues.apache.org/jira/browse/LUCENE-4599
 Project: Lucene - Core
  Issue Type: Task
  Components: core/codecs, core/termvectors
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.1

 Attachments: LUCENE-4599.patch


 We should have codec-compressed term vectors similarly to what we have with 
 stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4634) PackedInts: streaming API that supports variable numbers of bits per value

2012-12-18 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4634:
-

Attachment: LUCENE-4634.patch

Here is a patch. (I would like to use it for LUCENE-4599.)

 PackedInts: streaming API that supports variable numbers of bits per value
 --

 Key: LUCENE-4634
 URL: https://issues.apache.org/jira/browse/LUCENE-4634
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4634.patch


 It could be convenient to have a streaming API (writers and iterators, no 
 random access) that supports variable numbers of bits per value. Although 
 this would be much slower than the current fixed-size APIs, it could help 
 save bytes in our codec formats.
 The API could look like:
 {code}
 Iterator {
   long next(int bitsPerValue);
 }
 Writer {
   void write(long value, int bitsPerValue); // assert 
 PackedInts.bitsRequired(value) = bitsPerValue;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4215) Optimize facets when multi-valued field is really single valued

2012-12-19 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13535804#comment-13535804
 ] 

Adrien Grand commented on SOLR-4215:


Hi Ryan. I think this test should be done on every segment rather than on the 
top-level composite reader, because if the index has several segments, 
{{terms(field)}} will return a MultiTerms instance whose {{size()}} method 
always returns -1?

 Optimize facets when multi-valued field is really single valued
 ---

 Key: SOLR-4215
 URL: https://issues.apache.org/jira/browse/SOLR-4215
 Project: Solr
  Issue Type: Improvement
  Components: SearchComponents - other
Reporter: Ryan McKinley
Priority: Minor
 Fix For: 4.1

 Attachments: SOLR-4215-check-single-valued.patch


 In lucene 4+, the Terms interface can quickly tell us if the index is 
 actually single-valued.  We should use that for better facet performance with 
 multi-valued fields (when they are actually single valued)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4599) Compressed term vectors

2012-12-19 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4599:
-

Attachment: LUCENE-4599.patch

New patch (still not committable yet) with better compression ratio thanks to 
the following optimizations:
 * the block of data compressed by LZ4 only contains term and payload bytes 
(without their lengths), everything else (positions, flags, term lengths, etc.) 
is stored using packed ints,
 * term freqs are encoded in a pfor-like way to save space (this was a 3x/4x 
decrease of the space needed to store freqs),
 * when all fields have the same flags (a 3-bits int that says whether 
positions/offsets/payloads are enabled), the flag is stored only once per 
distinct field,
 * when both positions and offsets are enabled, I compute average term lengths 
and only store the difference between the start offset and the expected start 
offset computed from the average term length and the position,
 * for lengths, this impl stores the difference between the indexed term length 
and the actual length (endOffset - startOffset), with an optimization when they 
are always equal to 0 (can happen with ASCII and an analyzer that does not 
perform stemming).

Depending on the size of docs, not the same data takes most space in a single 
chunk:
|| || Small docs (28 * 1K) || Large doc (1 * 750K) ||
| Total chunk size (positions and offsets enabled) | 21K | 450K |
| Term bytes | 11K (16K before compression) | 64K (84K before compression) |
| Term lengths | 2K | 8K |
| Positions | 3K | 215K |
| Offsets | 3K (4K if positions are disabled) | 150K (240K if positions are 
disabled) |
| Term freqs | 500 | 7K |
the rest is negligible

 * So with small docs, most of space is occupied by term bytes whereas with 
large docs positions and offsets can easily take 80% of the chunk size.
 * Compression might not be as good as with stored fields, especially when docs 
are large because terms have already been deduplicated.

Overall, the on-disk format is more compact than the Lucene40 term vectors 
format (positions and offsets enabled, the number of documents indexed is not 
the same for small and large docs):
|| || Small docs || Large docs ||
| Lucene40 tvx | 160033 | 1633 |
| Lucene40 tvd | 49971 | 232 |
| Lucene40 tvf | 11279483 | 56640734 |
| Compressing tvx | 1116 | 78 |
| Compressing tvd | 7589550 | 44633841 |

This impl is 34% smaller than the Lucene40 one on small docs (mainly thanks to 
compression) and 21% on large docs (mainly thanks to packed ints). If you have 
other ideas to improve this ratio, let me know!

I still have to write more tests, clean up the patch, make reading term vectors 
more memory-efficient, and implement efficient merging...

 Compressed term vectors
 ---

 Key: LUCENE-4599
 URL: https://issues.apache.org/jira/browse/LUCENE-4599
 Project: Lucene - Core
  Issue Type: Task
  Components: core/codecs, core/termvectors
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.1

 Attachments: LUCENE-4599.patch, LUCENE-4599.patch


 We should have codec-compressed term vectors similarly to what we have with 
 stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets

2012-12-20 Thread Adrien Grand (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13536928#comment-13536928
]

Adrien Grand commented on LUCENE-4609:
--

bq. Attached a PackedEncoder, which is based on PackedInts.

Nice! You could probably improve memory efficiency and speed of the decoder by
using a ReaderIterator instead of a Reader:
* getReader: consumes the packed array stream and returns an in-memory packed
array,
* getDirectReader: does not consume the whole stream and return an impl that
uses IndexInput.seek to look up values,
* getReaderIterator: returns a sequential iterator which bulk-decodes values
(the mem parameter allows you to control the speed/memory-efficiency
trade-off), so it will be much faster than iterating over the values of
getReader.

For improved speed, getReaderIterator has the {{next(int count)}} method which
returns several values in a single call, this proved to be faster. Another
option could be to directly use PackedInts.Encoder/Decoder similarly to
Lucene41PostingsFormat (packed writers and reader iterators also use them under
the hood).

bq. This is PForDelta compression (the outliers are encoded separately) I
think? We can test it and see if it helps ... but we weren't so happy with it
for encoding postings

If the packed stream is very large, another option is to split it into blocks
that all have the same number of values (but different number of bits per
value). This should prevent the whole stream from growing because of rare
extreme values. This is what the stored fields index (with blocks of 1024
values) and Lucene41PostingsFormat (with blocks of 128 values) do. Storing the
min value at the beginning of the block and then only encoding deltas could
help too.

bq. The header is very large ... really you should only need 1) bpv, and 2)
bytes.length (which I think you already have, via both payloads and DocValues).
If the PackedInts API isn't flexible enough for you to feed it bpv and
bytes.length then let's fix that!

Most PackedInts method have a *NoHeader variant that does the exact same job
whithout relying on a header at the beginning of the stream (LUCENE-4161), I
think this is what you are looking for. We should probably make this header
stuff opt-in rather than opt-out (by replacing getWriter/Reader/ReaderIterator
with the NoHeader methods and adding a method dedicated to reading/writing a
header).

Write a PackedIntsEncoder/Decoder for facets

Key: LUCENE-4609
URL: https://issues.apache.org/jira/browse/LUCENE-4609
Project: Lucene - Core
Issue Type: New Feature
Components: modules/facet
Reporter: Shai Erera
Priority: Minor
Attachments: LUCENE-4609.patch

Today the facets API lets you write IntEncoder/Decoder to encode/decode the
category ordinals. We have several such encoders, including VInt (default),
and block encoders.
It would be interesting to implement and benchmark a
PackedIntsEncoder/Decoder, with potentially two variants: (1) receives
bitsPerValue up front, when you e.g. know that you have a small taxonomy and
the max value you can see and (2) one that decides for each doc on the
optimal bitsPerValue, writes it as a header in the byte[] or something.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-4634) PackedInts: streaming API that supports variable numbers of bits per value

2012-12-21 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4634.
--

Resolution: Fixed

 PackedInts: streaming API that supports variable numbers of bits per value
 --

 Key: LUCENE-4634
 URL: https://issues.apache.org/jira/browse/LUCENE-4634
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4634.patch


 It could be convenient to have a streaming API (writers and iterators, no 
 random access) that supports variable numbers of bits per value. Although 
 this would be much slower than the current fixed-size APIs, it could help 
 save bytes in our codec formats.
 The API could look like:
 {code}
 Iterator {
   long next(int bitsPerValue);
 }
 Writer {
   void write(long value, int bitsPerValue); // assert 
 PackedInts.bitsRequired(value) = bitsPerValue;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-4643) PackedInts: convenience classes to write blocks of packed ints

2012-12-21 Thread Adrien Grand (JIRA)

Adrien Grand created LUCENE-4643:


 Summary: PackedInts: convenience classes to write blocks of packed 
ints
 Key: LUCENE-4643
 URL: https://issues.apache.org/jira/browse/LUCENE-4643
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor


It is often useful to divide a packed stream into fixed blocks which are all 
compressed independently:
 * if your sequence of ints is very large, you won't have to buffer everything 
into memory to compute the required number of bits per value,
 * the compression ratio will be better in case of rare extreme values.

The only drawback compared to the original PackedInts API is that the stream 
cannot be directly used to deserialize a random-access PackedInts.Reader (but 
for sequential access, this is just fine).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4643) PackedInts: convenience classes to write blocks of packed ints

2012-12-21 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4643:
-

Attachment: LUCENE-4643.patch

Patch. This should be useful for LUCENE-4609 and LUCENE-4599, what do you think?

 PackedInts: convenience classes to write blocks of packed ints
 --

 Key: LUCENE-4643
 URL: https://issues.apache.org/jira/browse/LUCENE-4643
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4643.patch


 It is often useful to divide a packed stream into fixed blocks which are all 
 compressed independently:
  * if your sequence of ints is very large, you won't have to buffer 
 everything into memory to compute the required number of bits per value,
  * the compression ratio will be better in case of rare extreme values.
 The only drawback compared to the original PackedInts API is that the stream 
 cannot be directly used to deserialize a random-access PackedInts.Reader (but 
 for sequential access, this is just fine).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets

2012-12-21 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13538143#comment-13538143
 ] 

Adrien Grand commented on LUCENE-4609:
--

Gilad, I created LUCENE-4643 which I assume should be better than 
PackedInts.Writer and PackedInts.ReaderIterator for your use-case? It doesn't 
write heavyweight headers (meaning that you need to know the PackedInts version 
and the size of the stream otherwise) and encodes data in fixed-size blocks.

 Write a PackedIntsEncoder/Decoder for facets
 

 Key: LUCENE-4609
 URL: https://issues.apache.org/jira/browse/LUCENE-4609
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/facet
Reporter: Shai Erera
Priority: Minor
 Attachments: LUCENE-4609.patch


 Today the facets API lets you write IntEncoder/Decoder to encode/decode the 
 category ordinals. We have several such encoders, including VInt (default), 
 and block encoders.
 It would be interesting to implement and benchmark a 
 PackedIntsEncoder/Decoder, with potentially two variants: (1) receives 
 bitsPerValue up front, when you e.g. know that you have a small taxonomy and 
 the max value you can see and (2) one that decides for each doc on the 
 optimal bitsPerValue, writes it as a header in the byte[] or something.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4643) PackedInts: convenience classes to write blocks of packed ints

2012-12-21 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4643:
-

Attachment: LUCENE-4643.patch

Good point. I removed zig-zag encoding and modified the javadocs to say these 
classes only support positive values.

 PackedInts: convenience classes to write blocks of packed ints
 --

 Key: LUCENE-4643
 URL: https://issues.apache.org/jira/browse/LUCENE-4643
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4643.patch, LUCENE-4643.patch


 It is often useful to divide a packed stream into fixed blocks which are all 
 compressed independently:
  * if your sequence of ints is very large, you won't have to buffer 
 everything into memory to compute the required number of bits per value,
  * the compression ratio will be better in case of rare extreme values.
 The only drawback compared to the original PackedInts API is that the stream 
 cannot be directly used to deserialize a random-access PackedInts.Reader (but 
 for sequential access, this is just fine).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-4656) Fix EmptyTokenizer

2013-01-03 Thread Adrien Grand (JIRA)

Adrien Grand created LUCENE-4656:


 Summary: Fix EmptyTokenizer
 Key: LUCENE-4656
 URL: https://issues.apache.org/jira/browse/LUCENE-4656
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial


TestRandomChains can fail because EmptyTokenizer doesn't have a 
CharTermAttribute and doesn't compute the end offset (if the offset attribute 
was added by a filter).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4656) Fix EmptyTokenizer

2013-01-03 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4656:
-

Attachment: LUCENE-4656.patch

Patch. I wasn't sure whether to add a CharTermAttribute to EmptyTokenizer or to 
try fixing BaseTokenStreamTestCase but I couldn't think of a non-trivial 
tokenizer that wouldn't have a CharTermAttribute so I left the assertion that 
checks that a token stream always has a CharTermAttribute.

 Fix EmptyTokenizer
 --

 Key: LUCENE-4656
 URL: https://issues.apache.org/jira/browse/LUCENE-4656
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Attachments: LUCENE-4656.patch


 TestRandomChains can fail because EmptyTokenizer doesn't have a 
 CharTermAttribute and doesn't compute the end offset (if the offset attribute 
 was added by a filter).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4656) Fix EmptyTokenizer

2013-01-03 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542899#comment-13542899
 ] 

Adrien Grand commented on LUCENE-4656:
--

bq. Why do we have that?

It feels strange to me that a non-trivial TokenStream could have no 
CharTermAttribute?

 Fix EmptyTokenizer
 --

 Key: LUCENE-4656
 URL: https://issues.apache.org/jira/browse/LUCENE-4656
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Attachments: LUCENE-4656.patch


 TestRandomChains can fail because EmptyTokenizer doesn't have a 
 CharTermAttribute and doesn't compute the end offset (if the offset attribute 
 was added by a filter).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4656) Fix EmptyTokenizer

2013-01-03 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4656:
-

Attachment: LUCENE-4656.patch

Alternative patch that fixes BaseTokenStreamTestCase. I needed to add a quick 
hack to add a TermToBytesRefAttribute when the tokenstream doesn't have one so 
that TermsHashPerField doesn't complain that it can't find this attribute when 
indexing.

 Fix EmptyTokenizer
 --

 Key: LUCENE-4656
 URL: https://issues.apache.org/jira/browse/LUCENE-4656
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Attachments: LUCENE-4656.patch, LUCENE-4656.patch


 TestRandomChains can fail because EmptyTokenizer doesn't have a 
 CharTermAttribute and doesn't compute the end offset (if the offset attribute 
 was added by a filter).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4656) Fix EmptyTokenizer

2013-01-03 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542995#comment-13542995
 ] 

Adrien Grand commented on LUCENE-4656:
--

bq. So we should fix IndexWriter to handle that case?

How would IndexWriter handle token streams with no TermToBytesRefAttribute?
 - fail if the tokens stream happens to have tokens? (incrementToken returns 
true at least once)
 - index empty terms?

 Fix EmptyTokenizer
 --

 Key: LUCENE-4656
 URL: https://issues.apache.org/jira/browse/LUCENE-4656
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Attachments: LUCENE-4656.patch, LUCENE-4656.patch


 TestRandomChains can fail because EmptyTokenizer doesn't have a 
 CharTermAttribute and doesn't compute the end offset (if the offset attribute 
 was added by a filter).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4656) Fix IndexWriter working together with EmptyTokenizer and EmptyTokenStream (without CharTermAttribute), fix BaseTokenStreamTestCase

2013-01-03 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543133#comment-13543133
 ] 

Adrien Grand commented on LUCENE-4656:
--

Uwe, I just ran all Lucene tests with your patch and they passed, so +1. +1 to 
removing EmptyTokenizer too.

 Fix IndexWriter working together with EmptyTokenizer and EmptyTokenStream 
 (without CharTermAttribute), fix BaseTokenStreamTestCase
 --

 Key: LUCENE-4656
 URL: https://issues.apache.org/jira/browse/LUCENE-4656
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Affects Versions: 4.0
Reporter: Adrien Grand
Assignee: Uwe Schindler
Priority: Trivial
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4656_bttc.patch, LUCENE-4656-IW-bug.patch, 
 LUCENE-4656-IW-fix.patch, LUCENE-4656-IW-fix.patch, LUCENE-4656.patch, 
 LUCENE-4656.patch, LUCENE-4656.patch, LUCENE-4656.patch, LUCENE-4656.patch


 TestRandomChains can fail because EmptyTokenizer doesn't have a 
 CharTermAttribute and doesn't compute the end offset (if the offset attribute 
 was added by a filter).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4643) PackedInts: convenience classes to write blocks of packed ints

2013-01-07 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13545995#comment-13545995
 ] 

Adrien Grand commented on LUCENE-4643:
--

I made some tests with my compressed TermVectorsFormat and the problem is that 
it sometimes wastes space. For example if all values from a block are between 
-1 and 6, the first patch would require 3 bits whereas the 2nd one + zig-zag 
encoding a level above would require 4 bits per value so I think I should 
rather commit the first patch?



 PackedInts: convenience classes to write blocks of packed ints
 --

 Key: LUCENE-4643
 URL: https://issues.apache.org/jira/browse/LUCENE-4643
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4643.patch, LUCENE-4643.patch


 It is often useful to divide a packed stream into fixed blocks which are all 
 compressed independently:
  * if your sequence of ints is very large, you won't have to buffer 
 everything into memory to compute the required number of bits per value,
  * the compression ratio will be better in case of rare extreme values.
 The only drawback compared to the original PackedInts API is that the stream 
 cannot be directly used to deserialize a random-access PackedInts.Reader (but 
 for sequential access, this is just fine).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4643) PackedInts: convenience classes to write blocks of packed ints

2013-01-07 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13546017#comment-13546017
 ] 

Adrien Grand commented on LUCENE-4643:
--

bq. actually i'm confused why we need it at all, since we are writing only 
positive numbers (deltas from minValue, which itself is the only one that need 
be negative).

Oh! I think we misunderstood. The first patch uses zig-zag encoding for 
minValue only and the 2nd patch requires people to zig-zag encode before 
feeding the writer.

 PackedInts: convenience classes to write blocks of packed ints
 --

 Key: LUCENE-4643
 URL: https://issues.apache.org/jira/browse/LUCENE-4643
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4643.patch, LUCENE-4643.patch


 It is often useful to divide a packed stream into fixed blocks which are all 
 compressed independently:
  * if your sequence of ints is very large, you won't have to buffer 
 everything into memory to compute the required number of bits per value,
  * the compression ratio will be better in case of rare extreme values.
 The only drawback compared to the original PackedInts API is that the stream 
 cannot be directly used to deserialize a random-access PackedInts.Reader (but 
 for sequential access, this is just fine).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4643) PackedInts: convenience classes to write blocks of packed ints

2013-01-07 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13546038#comment-13546038
 ] 

Adrien Grand commented on LUCENE-4643:
--

All bits are currently used (one to say whether the minValue is 0 or not and 7 
for the number of bitsPerValue (0 = bpv = 64, 0 means all values equal, 
similarly to the block PF). But maybe we could:
 1. add a constructor argument to say that all values are positive, and it 
won't zig-zag encode,
 2. or disable either the 0 or the 64 bits per value cases and add a sign bit?

I think the first option is better?

 PackedInts: convenience classes to write blocks of packed ints
 --

 Key: LUCENE-4643
 URL: https://issues.apache.org/jira/browse/LUCENE-4643
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4643.patch, LUCENE-4643.patch


 It is often useful to divide a packed stream into fixed blocks which are all 
 compressed independently:
  * if your sequence of ints is very large, you won't have to buffer 
 everything into memory to compute the required number of bits per value,
  * the compression ratio will be better in case of rare extreme values.
 The only drawback compared to the original PackedInts API is that the stream 
 cannot be directly used to deserialize a random-access PackedInts.Reader (but 
 for sequential access, this is just fine).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4643) PackedInts: convenience classes to write blocks of packed ints

2013-01-07 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13546051#comment-13546051
 ] 

Adrien Grand commented on LUCENE-4643:
--

bq. just because of the silliness in termvectors

Actually, the ability to block-encode negative values can be useful for other 
use-cases, for example to encode the difference from an expected value (for 
example you can compute an expected offset from the position and the average 
number of chars per term).

An other thing to know is that if all values are positive, minValue is likely 
to be 0. For example, let's say the actual min is 200 and the max is 2000. 
Given that encoding the [0-2000] range requires as many bits per value as 
encoding the [200-2000] range, I set minValue=0. This will require only one bit 
in the token instead of two bytes (a VInt = 2^7) for the minimum. So in the 
end, even if one bit is wasted for the minimum value because of zig-zag 
encoding, this is not too bad.

 PackedInts: convenience classes to write blocks of packed ints
 --

 Key: LUCENE-4643
 URL: https://issues.apache.org/jira/browse/LUCENE-4643
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4643.patch, LUCENE-4643.patch


 It is often useful to divide a packed stream into fixed blocks which are all 
 compressed independently:
  * if your sequence of ints is very large, you won't have to buffer 
 everything into memory to compute the required number of bits per value,
  * the compression ratio will be better in case of rare extreme values.
 The only drawback compared to the original PackedInts API is that the stream 
 cannot be directly used to deserialize a random-access PackedInts.Reader (but 
 for sequential access, this is just fine).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-4664) oal.codec.compressing: Make Compressor and Decompressor public

2013-01-07 Thread Adrien Grand (JIRA)

Adrien Grand created LUCENE-4664:


 Summary: oal.codec.compressing: Make Compressor and Decompressor 
public
 Key: LUCENE-4664
 URL: https://issues.apache.org/jira/browse/LUCENE-4664
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.1, 5.0


Compressor and Decompressor are currently package-private, making it impossible 
for users to implement their own CompressionMode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4664) oal.codec.compressing: Make Compressor and Decompressor public

2013-01-07 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4664:
-

Attachment: LUCENE-4664.patch

Patch. I moved DummyCompressingCodec to oal.codecs.compressing.dummy to make 
sure all classes are visible enough.

 oal.codec.compressing: Make Compressor and Decompressor public
 --

 Key: LUCENE-4664
 URL: https://issues.apache.org/jira/browse/LUCENE-4664
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4664.patch


 Compressor and Decompressor are currently package-private, making it 
 impossible for users to implement their own CompressionMode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-4664) oal.codec.compressing: Make Compressor and Decompressor public

2013-01-08 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4664.
--

Resolution: Fixed

 oal.codec.compressing: Make Compressor and Decompressor public
 --

 Key: LUCENE-4664
 URL: https://issues.apache.org/jira/browse/LUCENE-4664
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4664.patch


 Compressor and Decompressor are currently package-private, making it 
 impossible for users to implement their own CompressionMode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-4643) PackedInts: convenience classes to write blocks of packed ints

2013-01-08 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4643.
--

Resolution: Fixed

 PackedInts: convenience classes to write blocks of packed ints
 --

 Key: LUCENE-4643
 URL: https://issues.apache.org/jira/browse/LUCENE-4643
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4643.patch, LUCENE-4643.patch


 It is often useful to divide a packed stream into fixed blocks which are all 
 compressed independently:
  * if your sequence of ints is very large, you won't have to buffer 
 everything into memory to compute the required number of bits per value,
  * the compression ratio will be better in case of rare extreme values.
 The only drawback compared to the original PackedInts API is that the stream 
 cannot be directly used to deserialize a random-access PackedInts.Reader (but 
 for sequential access, this is just fine).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-4666) Simplify CompressingStoredFieldsFormat merging

2013-01-08 Thread Adrien Grand (JIRA)

Adrien Grand created LUCENE-4666:


 Summary: Simplify CompressingStoredFieldsFormat merging
 Key: LUCENE-4666
 URL: https://issues.apache.org/jira/browse/LUCENE-4666
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.1, 5.0


Merging is currently unnecessarily complex: it tries to compute the size of the 
compressed block by analyzing the compressed stream although it could use the 
fields index instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4666) Simplify CompressingStoredFieldsFormat merging

2013-01-08 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4666:
-

Attachment: LUCENE-4666.patch

Patch.

 Simplify CompressingStoredFieldsFormat merging
 --

 Key: LUCENE-4666
 URL: https://issues.apache.org/jira/browse/LUCENE-4666
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4666.patch


 Merging is currently unnecessarily complex: it tries to compute the size of 
 the compressed block by analyzing the compressed stream although it could use 
 the fields index instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-4667) Change TestRandomChains to replace the list of broken classes by a list of broken constructors

2013-01-08 Thread Adrien Grand (JIRA)

Adrien Grand created LUCENE-4667:


 Summary: Change TestRandomChains to replace the list of broken 
classes by a list of broken constructors
 Key: LUCENE-4667
 URL: https://issues.apache.org/jira/browse/LUCENE-4667
 Project: Lucene - Core
  Issue Type: Task
Reporter: Adrien Grand
Priority: Minor


Some classes are currently in the list of bad apples although only one 
constructor is broken. For example, LimitTokenCountFilter has an option to 
consume the whole stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-4666) Simplify CompressingStoredFieldsFormat merging

2013-01-09 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4666.
--

Resolution: Fixed

 Simplify CompressingStoredFieldsFormat merging
 --

 Key: LUCENE-4666
 URL: https://issues.apache.org/jira/browse/LUCENE-4666
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4666.patch


 Merging is currently unnecessarily complex: it tries to compute the size of 
 the compressed block by analyzing the compressed stream although it could use 
 the fields index instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4667) Change TestRandomChains to replace the list of broken classes by a list of broken constructors

2013-01-09 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4667:
-

Attachment: LUCENE-4667.patch

Patch.

 Change TestRandomChains to replace the list of broken classes by a list of 
 broken constructors
 --

 Key: LUCENE-4667
 URL: https://issues.apache.org/jira/browse/LUCENE-4667
 Project: Lucene - Core
  Issue Type: Task
Reporter: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4667.patch


 Some classes are currently in the list of bad apples although only one 
 constructor is broken. For example, LimitTokenCountFilter has an option to 
 consume the whole stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4667) Change TestRandomChains to replace the list of broken classes by a list of broken constructors

2013-01-09 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548505#comment-13548505
 ] 

Adrien Grand commented on LUCENE-4667:
--

The test failed when I used an IdentityHashMap. Did I miss something or can't 
constructors be compared using ==?

 Change TestRandomChains to replace the list of broken classes by a list of 
 broken constructors
 --

 Key: LUCENE-4667
 URL: https://issues.apache.org/jira/browse/LUCENE-4667
 Project: Lucene - Core
  Issue Type: Task
Reporter: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4667.patch


 Some classes are currently in the list of bad apples although only one 
 constructor is broken. For example, LimitTokenCountFilter has an option to 
 consume the whole stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4667) Change TestRandomChains to replace the list of broken classes by a list of broken constructors

2013-01-09 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4667:
-

Attachment: LUCENE-4667.patch

New patch that adds exceptions to TrimFilter and TypeTokenFilter as well and 
uses a constructor map for all components, following Uwe's advice.

 Change TestRandomChains to replace the list of broken classes by a list of 
 broken constructors
 --

 Key: LUCENE-4667
 URL: https://issues.apache.org/jira/browse/LUCENE-4667
 Project: Lucene - Core
  Issue Type: Task
Reporter: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4667.patch, LUCENE-4667.patch


 Some classes are currently in the list of bad apples although only one 
 constructor is broken. For example, LimitTokenCountFilter has an option to 
 consume the whole stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-4667) Change TestRandomChains to replace the list of broken classes by a list of broken constructors

2013-01-09 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand reassigned LUCENE-4667:


Assignee: Adrien Grand

 Change TestRandomChains to replace the list of broken classes by a list of 
 broken constructors
 --

 Key: LUCENE-4667
 URL: https://issues.apache.org/jira/browse/LUCENE-4667
 Project: Lucene - Core
  Issue Type: Task
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4667.patch, LUCENE-4667.patch


 Some classes are currently in the list of bad apples although only one 
 constructor is broken. For example, LimitTokenCountFilter has an option to 
 consume the whole stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4667) Change TestRandomChains to replace the list of broken classes by a list of broken constructors

2013-01-09 Thread Adrien Grand (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Adrien Grand updated LUCENE-4667:
-

Attachment: LUCENE-4667.patch

bq. Maybe that's the case! Sorry. I was expecting that constructors are
singletons like classes.

No problem, I had the same expectation and was a little disappointed to see
that it didn't work!

bq. I think maybe the whole Predicate approach is too much detailed?

I think it's worth exluding with a predicate: for example this allows to test
random chains with LimitTokenCountFilter(consumeAllTokens=true) (when
consumeAllTokens=false, this filter is broken).

bq. I would exclude all broken construcors with the ALWAYS predicate in
beforeClass()

Sounds good, I updated the patch.

Change TestRandomChains to replace the list of broken classes by a list of
broken constructors
--

Key: LUCENE-4667
URL: https://issues.apache.org/jira/browse/LUCENE-4667
Project: Lucene - Core
Issue Type: Task
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
Attachments: LUCENE-4667.patch, LUCENE-4667.patch, LUCENE-4667.patch

Some classes are currently in the list of bad apples although only one
constructor is broken. For example, LimitTokenCountFilter has an option to
consume the whole stream.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4669) Document wrongly deleted from index

2013-01-09 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548622#comment-13548622
 ] 

Adrien Grand commented on LUCENE-4669:
--

Hi Miguel,

c has not been deleted, the problem is that you used IndexReader.numDocs 
instead of IndexReader.maxDoc. Given that you deleted a document, 
IndexReader.numDocs decreased from 3 to 2 but c still has docId==2 so your 
print(File) method doesn't display it.

 Document wrongly deleted from index
 ---

 Key: LUCENE-4669
 URL: https://issues.apache.org/jira/browse/LUCENE-4669
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/index
Affects Versions: 4.0
 Environment: OS = Mac OS X 10.7.5
 Java = JVM 1.6
Reporter: Miguel Ferreira

 I'm trying to implement document deletion from an index.
 If I create an index with three documents (A, B and C) and then try to delete 
 A, A gets marked as deleted but C is removed from the index. I've tried this 
 with different number of documents and saw that it is always the last 
 document that is removed.
 When I run the example unit test code bellow I get this output:
 {code}
 Before delete
 Found 3 documents
 Document at = 0; isDeleted = false; path = a; 
 Document at = 1; isDeleted = false; path = b; 
 Document at = 2; isDeleted = false; path = c; 
 After delete
 Found 2 documents
 Document at = 0; isDeleted = true; path = a; 
 Document at = 1; isDeleted = false; path = b; 
 {code}
 Example unit test:
 {code:title=ExampleUnitTest.java}
 @Test
 public void delete() throws Exception {
 File indexDir = FileUtils.createTempDir();
 IndexWriter writer = new IndexWriter(new NIOFSDirectory(indexDir), 
 new IndexWriterConfig(Version.LUCENE_40,
 new StandardAnalyzer(Version.LUCENE_40)));
 Document doc = new Document();
 String fieldName = path;
 doc.add(new StringField(fieldName, a, Store.YES));
 writer.addDocument(doc);
 doc = new Document();
 doc.add(new StringField(fieldName, b, Store.YES));
 writer.addDocument(doc);
 doc = new Document();
 doc.add(new StringField(fieldName, c, Store.YES));
 writer.addDocument(doc);
 writer.commit();
 System.out.println(Before delete);
 print(indexDir);
 writer.deleteDocuments(new Term(fieldName, a));
 writer.commit();
 System.out.println(After delete);
 print(indexDir);
 }
 public static void print(File indexDirectory) throws IOException {
 DirectoryReader reader = DirectoryReader.open(new 
 NIOFSDirectory(indexDirectory));
 Bits liveDocs = MultiFields.getLiveDocs(reader);
 int numDocs = reader.numDocs();
 System.out.println(Found  + numDocs +  documents);
 for (int i = 0; i  numDocs; i++) {
 Document document = reader.document(i);
 StringBuffer sb = new StringBuffer();
 sb.append(Document at = ).append(i);
 sb.append(; isDeleted = ).append(liveDocs != null ? 
 !liveDocs.get(i) : false).append(; );
 for (IndexableField field : document.getFields()) {
 String fieldName = field.name();
 for (String value : document.getValues(fieldName)) {
 sb.append(fieldName).append( = 
 ).append(value).append(; );
 }
 }
 System.out.println(sb.toString());
 }
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-4667) Change TestRandomChains to replace the list of broken classes by a list of broken constructors

2013-01-09 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4667.
--

Resolution: Fixed

 Change TestRandomChains to replace the list of broken classes by a list of 
 broken constructors
 --

 Key: LUCENE-4667
 URL: https://issues.apache.org/jira/browse/LUCENE-4667
 Project: Lucene - Core
  Issue Type: Task
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4667.patch, LUCENE-4667.patch, LUCENE-4667.patch


 Some classes are currently in the list of bad apples although only one 
 constructor is broken. For example, LimitTokenCountFilter has an option to 
 consume the whole stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-4670) Add TermVectorsWriter.finish{Doc,Field,Term} to make development of new formats easier

2013-01-09 Thread Adrien Grand (JIRA)

Adrien Grand created LUCENE-4670:


 Summary: Add TermVectorsWriter.finish{Doc,Field,Term} to make 
development of new formats easier
 Key: LUCENE-4670
 URL: https://issues.apache.org/jira/browse/LUCENE-4670
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.1


This is especially useful to LUCENE-4599 where actions have to be taken after a 
doc/field/term has been added.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

< 3 4 5 6 7 8 9 10 11 12 >

701 - 800 of 4980 matches

Mail list logo