[jira] [Commented] (LUCENE-8939) Shared Hit Count Early Termination

2019-09-13 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929314#comment-16929314
 ] 

Michael McCandless commented on LUCENE-8939:


Thanks [~jpountz], this is an exciting improvement for concurrent search users!

> Shared Hit Count Early Termination
> --
>
> Key: LUCENE-8939
> URL: https://issues.apache.org/jira/browse/LUCENE-8939
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Atri Sharma
>Priority: Major
> Fix For: 8.3
>
>  Time Spent: 12h 20m
>  Remaining Estimate: 0h
>
> When collecting hits across sorted segments, it should be possible to 
> terminate early across all slices when enough hits have been collected 
> globally i.e. hit count > numHits AND hit count < totalHitsThreshold



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8939) Shared Hit Count Early Termination

2019-09-11 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927570#comment-16927570
 ] 

Michael McCandless commented on LUCENE-8939:


Is this issue done?  Will we backport to 8.x?

> Shared Hit Count Early Termination
> --
>
> Key: LUCENE-8939
> URL: https://issues.apache.org/jira/browse/LUCENE-8939
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 12h 20m
>  Remaining Estimate: 0h
>
> When collecting hits across sorted segments, it should be possible to 
> terminate early across all slices when enough hits have been collected 
> globally i.e. hit count > numHits AND hit count < totalHitsThreshold



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7282) search APIs should take advantage of index sort by default

2019-09-10 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926850#comment-16926850
 ] 

Michael McCandless commented on LUCENE-7282:


Aha, thanks [~atris]!

> search APIs should take advantage of index sort by default
> --
>
> Key: LUCENE-7282
> URL: https://issues.apache.org/jira/browse/LUCENE-7282
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> Spinoff from LUCENE-6766, where we made it very easy to have Lucene sort 
> documents in the index (at merge time).
> An index-time sort is powerful because if you then search that index by the 
> same sort (or by a "prefix" of it), you can early-terminate per segment once 
> you've collected enough hits.  But doing this by default would mean accepting 
> an approximate hit count, and could not be used in cases that need to see 
> every hit, e.g. if you are also faceting.
> Separately, `TermQuery` on the leading sort field can be very fast since we 
> can advance to the first docID, and only match to the last docID for the 
> requested value.  This would not be approximate, and should be lower risk / 
> easier.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7282) search APIs should take advantage of index sort by default

2019-09-10 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926839#comment-16926839
 ] 

Michael McCandless commented on LUCENE-7282:


Do we optimize the case where an exact or range DV query clause is "congruent" 
with index sort?  E.g. say my index sort is a {{DocValues.NUMERIC}} field 
{{foobar}} and then my query has a clause {{foobar=17}} then we can efficiently 
per segment skip to the {{docid}} range for the value {{17}} even if the user 
did not index dimensional points for that field. I thought we had an issue open 
for this but I can't find it now ...

> search APIs should take advantage of index sort by default
> --
>
> Key: LUCENE-7282
> URL: https://issues.apache.org/jira/browse/LUCENE-7282
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> Spinoff from LUCENE-6766, where we made it very easy to have Lucene sort 
> documents in the index (at merge time).
> An index-time sort is powerful because if you then search that index by the 
> same sort (or by a "prefix" of it), you can early-terminate per segment once 
> you've collected enough hits.  But doing this by default would mean accepting 
> an approximate hit count, and could not be used in cases that need to see 
> every hit, e.g. if you are also faceting.
> Separately, `TermQuery` on the leading sort field can be very fast since we 
> can advance to the first docID, and only match to the last docID for the 
> requested value.  This would not be approximate, and should be lower risk / 
> easier.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8963) Allow Collectors To "Publish" If They Can Be Used In Concurrent Search

2019-09-05 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923284#comment-16923284
 ] 

Michael McCandless commented on LUCENE-8963:


Do we have examples of collectors in Lucene today that are single-threaded?  
The core collectors, at least {{TopFieldCollector}} and {{TopDocsCollector}} 
seem to be OK since {{IndexSearcher}} makes a {{CollectorManager}} that uses 
{{TopDocs.merge}} in the end.

So maybe as long as a {{CollectorManager}} is available that implies it is 
thread safe?

> Allow Collectors To "Publish" If They Can Be Used In Concurrent Search
> --
>
> Key: LUCENE-8963
> URL: https://issues.apache.org/jira/browse/LUCENE-8963
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> There is an implied assumption today that all we need to run a query 
> concurrently is a CollectorManager implementation. While that is true, there 
> might be some corner cases where a Collector's semantics do not allow it to 
> be concurrently executed (think of ES's aggregates). If a user manages to 
> write a CollectorManager with a Collector that is not really concurrent 
> friendly, we could end up in an undefined state.
>  
> This Jira is more of a rhetorical discussion, and to explore if we should 
> allow Collectors to implement an API which simply returns a boolean 
> signifying if a Collector is parallel ready or not. The default would be 
> true, until a Collector explicitly overrides it?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8884) Add Directory wrapper to track per-query IO counters

2019-09-04 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-8884:
---
Attachment: LUCENE-8884.patch
Status: Open  (was: Open)

Another iteration folding [~rcmuir]'s feedback.

I was worried that the thread that called {{clone()}} may not be the same 
thread that then consumes the {{IndexInput}} and added an assertion, but it 
looks like it's OK.  I added another random test too, and improved javadocs.

I could not eliminate the required {{setKeyForThread}} call because even in the 
single threaded case, where only one thread executes the query across all 
segments, the directory wrapper still needs to know which query that is to 
track its IO counters.

I haven't tested performance impact of this but it's likely minor now since we 
now retrieve the counters on {{clone()}} instead of on every IO operation.

> Add Directory wrapper to track per-query IO counters
> 
>
> Key: LUCENE-8884
> URL: https://issues.apache.org/jira/browse/LUCENE-8884
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/store
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-8884.patch, LUCENE-8884.patch
>
>
> Lucene's IO abstractions ({{Directory, IndexInput/Output}}) make it really 
> easy to track counters of how many IOPs and net bytes are read for each 
> query, which is a useful metric to track/aggregate/alarm on in production or 
> dev benchmarks.
> At my day job we use these wrappers in our nightly benchmarks to catch any 
> accidental performance regressions.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2019-09-04 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922559#comment-16922559
 ] 

Michael McCandless commented on LUCENE-8962:


Thanks [~dsmiley].

That sounds like a nice improvement to TMP, but I would want to match the more 
aggressive merging of those tiny segments w/ a refresh or commit, not run them 
in general since I think for pure indexing that'd hurt indexing throughput.  I 
think the tricky part of this change is fixing {{IndexWriter}} refresh or 
commit to let the merge policy know it should now aggressively merge small 
segments, within a time or total size budget or something, while the 
refresh/commit operation waits, so that the returned segments have been merged, 
even while (concurrently) new segments are flushed.

Synchronous merges in merge scheduler sound interesting for this use case – 
maybe open a separate issue for that?

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2019-09-03 Thread Michael McCandless (Jira)
Michael McCandless created LUCENE-8962:
--

 Summary: Can we merge small segments during refresh, for faster 
searching?
 Key: LUCENE-8962
 URL: https://issues.apache.org/jira/browse/LUCENE-8962
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Michael McCandless


With near-real-time search we ask {{IndexWriter}} to write all in-memory 
segments to disk and open an {{IndexReader}} to search them, and this is 
typically a quick operation.

However, when you use many threads for concurrent indexing, {{IndexWriter}} 
will accumulate write many small segments during {{refresh}} and this then adds 
search-time cost as searching must visit all of these tiny segments.

The merge policy would normally quickly coalesce these small segments if given 
a little time ... so, could we somehow improve {{IndexWriter'}}s refresh to 
optionally kick off merge policy to merge segments below some threshold before 
opening the near-real-time reader?  It'd be a bit tricky because while we are 
waiting for merges, indexing may continue, and new segments may be flushed, but 
those new segments shouldn't be included in the point-in-time segments returned 
by refresh ...

One could almost do this on top of Lucene today, with a custom merge policy, 
and some hackity logic to have the merge policy target small segments just 
written by refresh, but it's tricky to then open a near-real-time reader, 
excluding newly flushed but including newly merged segments since the refresh 
originally finished ...

I'm not yet sure how best to solve this, so I wanted to open an issue for 
discussion!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8403) Support 'filtered' term vectors - don't require all terms to be present

2019-08-31 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920099#comment-16920099
 ] 

Michael McCandless commented on LUCENE-8403:


Maybe first test the performance of the separate field?  If the double-analysis 
is really a problem, you could use {{CachingTokenFilter}} to analyze only once?

> Support 'filtered' term vectors - don't require all terms to be present
> ---
>
> Key: LUCENE-8403
> URL: https://issues.apache.org/jira/browse/LUCENE-8403
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Braun
>Priority: Minor
> Attachments: LUCENE-8403.patch
>
>
> The genesis of this was a conversation and idea from [~dsmiley] several years 
> ago.
> In order to optimize term vector storage, we may not actually need all tokens 
> to be present in the term vectors - and if so, ideally our codec could just 
> opt not to store them.
> I attempted to fork the standard codec and override the TermVectorsFormat and 
> TermVectorsWriter to ignore storing certain Terms within a field. This 
> worked, however, CheckIndex checks that the terms present in the standard 
> postings are also present in the TVs, if TVs enabled. So this then doesn't 
> work as 'valid' according to CheckIndex.
> Can the TermVectorsFormat be made in such a way to support configuration of 
> tokens that should not be stored (benefits: less storage, more optimal 
> retrieval per doc)? Is this valuable to the wider community? Is there a way 
> we can design this to not break CheckIndex's contract while at the same time 
> lessening storage for unneeded tokens?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8884) Add Directory wrapper to track per-query IO counters

2019-08-16 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909302#comment-16909302
 ] 

Michael McCandless commented on LUCENE-8884:


Thanks for the review [~rcmuir].

We need the thread locals because we pass {{ExecutorService}} to 
{{IndexSearcher}} to keep our long-pole query latencies down.  So we need some 
way to associate searcher thread with the query it's handling, but maybe we can 
make that less invasive, e.g. default better for the more common 
single-threaded query case?
{quote}must you call get on every op vs once in the ctor? after all thats why 
we have clone? ( should not have thread issues )
{quote}
Ahh that's a good point – once the {{IndexInput}} is created, only one thread 
will use it – I'll fix that!  This should reduce overhead substantially, maybe 
enough to run in production by default.
{quote}readint has a second spurious call.
{quote}
Woops, I'll fix that too.

> Add Directory wrapper to track per-query IO counters
> 
>
> Key: LUCENE-8884
> URL: https://issues.apache.org/jira/browse/LUCENE-8884
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/store
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-8884.patch
>
>
> Lucene's IO abstractions ({{Directory, IndexInput/Output}}) make it really 
> easy to track counters of how many IOPs and net bytes are read for each 
> query, which is a useful metric to track/aggregate/alarm on in production or 
> dev benchmarks.
> At my day job we use these wrappers in our nightly benchmarks to catch any 
> accidental performance regressions.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8884) Add Directory wrapper to track per-query IO counters

2019-08-14 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907361#comment-16907361
 ] 

Michael McCandless commented on LUCENE-8884:


I plan to push this soon ... just adds this new directory wrapper to {{misc}} 
module.

> Add Directory wrapper to track per-query IO counters
> 
>
> Key: LUCENE-8884
> URL: https://issues.apache.org/jira/browse/LUCENE-8884
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/store
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-8884.patch
>
>
> Lucene's IO abstractions ({{Directory, IndexInput/Output}}) make it really 
> easy to track counters of how many IOPs and net bytes are read for each 
> query, which is a useful metric to track/aggregate/alarm on in production or 
> dev benchmarks.
> At my day job we use these wrappers in our nightly benchmarks to catch any 
> accidental performance regressions.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8947) Indexing fails with "too many tokens for field" when using custom term frequencies

2019-08-14 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907330#comment-16907330
 ] 

Michael McCandless commented on LUCENE-8947:


Indeed we disable norms ... that’s a good idea to skip length accumulation when 
norms are disabled.  I’ll give that a shot.

> Indexing fails with "too many tokens for field" when using custom term 
> frequencies
> --
>
> Key: LUCENE-8947
> URL: https://issues.apache.org/jira/browse/LUCENE-8947
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 7.5
>Reporter: Michael McCandless
>Priority: Major
>
> We are using custom term frequencies (LUCENE-7854) to index per-token scoring 
> signals, however for one document that had many tokens and those tokens had 
> fairly large (~998,000) scoring signals, we hit this exception:
> {noformat}
> 2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3) 
> com.amazon.lucene.index.IndexGCRDocument: Failed to index doc: 
> java.lang.IllegalArgumentException: too many tokens for field "foobar"
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825)
> at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
> at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
> at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297)
> at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450)
> at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291)
> at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264)
> {noformat}
> This is happening in this code in {{DefaultIndexingChain.java}}:
> {noformat}
>   try {
> invertState.length = Math.addExact(invertState.length, 
> invertState.termFreqAttribute.getTermFrequency());
>   } catch (ArithmeticException ae) {
> throw new IllegalArgumentException("too many tokens for field \"" + 
> field.name() + "\"");
>   }{noformat}
> Where Lucene is accumulating the total length (number of tokens) for the 
> field.  But total length doesn't really make sense if you are using custom 
> term frequencies to hold arbitrary scoring signals?  Or, maybe it does make 
> sense, if user is using this as simple boosting, but maybe we should allow 
> this length to be a {{long}}?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8369) Remove the spatial module as it is obsolete

2019-08-12 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905566#comment-16905566
 ] 

Michael McCandless commented on LUCENE-8369:


+1 for option 1 above.

> Remove the spatial module as it is obsolete
> ---
>
> Key: LUCENE-8369
> URL: https://issues.apache.org/jira/browse/LUCENE-8369
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/spatial
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Attachments: LUCENE-8369.patch
>
>
> The "spatial" module is at this juncture nearly empty with only a couple 
> utilities that aren't used by anything in the entire codebase -- 
> GeoRelationUtils, and MortonEncoder.  Perhaps it should have been removed 
> earlier in LUCENE-7664 which was the removal of GeoPointField which was 
> essentially why the module existed.  Better late than never.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8947) Indexing fails with "too many tokens for field" when using custom term frequencies

2019-08-06 Thread Michael McCandless (JIRA)
Michael McCandless created LUCENE-8947:
--

 Summary: Indexing fails with "too many tokens for field" when 
using custom term frequencies
 Key: LUCENE-8947
 URL: https://issues.apache.org/jira/browse/LUCENE-8947
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 7.5
Reporter: Michael McCandless


We are using custom term frequencies (LUCENE-7854) to index per-token scoring 
signals, however for one document that had many tokens and those tokens had 
fairly large (~998,000) scoring signals, we hit this exception:
{noformat}
2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3) 
com.amazon.lucene.index.IndexGCRDocument: Failed to index doc: 
java.lang.IllegalArgumentException: too many tokens for field "foobar"
at 
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825)
at 
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
at 
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297)
at 
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450)
at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291)
at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264)
{noformat}
This is happening in this code in {{DefaultIndexingChain.java}}:
{noformat}
  try {
invertState.length = Math.addExact(invertState.length, 
invertState.termFreqAttribute.getTermFrequency());
  } catch (ArithmeticException ae) {
throw new IllegalArgumentException("too many tokens for field \"" + 
field.name() + "\"");
  }{noformat}
Where Lucene is accumulating the total length (number of tokens) for the field. 
 But total length doesn't really make sense if you are using custom term 
frequencies to hold arbitrary scoring signals?  Or, maybe it does make sense, 
if user is using this as simple boosting, but maybe we should allow this length 
to be a {{long}}?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8369) Remove the spatial module as it is obsolete

2019-07-25 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893187#comment-16893187
 ] 

Michael McCandless commented on LUCENE-8369:


{quote}Lots of awesome functionality _commonly_ needed in search is in our 
modules – like highlighting, autocomplete, and spellcheck, to name a few. Why 
should spatial be an exception?
{quote}
Well, other examples are default analysis ({{StandardAnalyzer}}), common 
queries (versus exotic queries in the queries module), most {{Directory}} 
implementations, where we have some common choices in core and more exotic 
choices in our modules.

I think it (the "common" classes and the "exotic" ones) is a helpful 
distinction for our users for areas that have many many options.

[~nknize] can you give a concrete example where the code sharing is making 
things difficult?  Can we simply make the necessary APIs public and marked 
{{@lucene.internal}} in our core spatial classes?

> Remove the spatial module as it is obsolete
> ---
>
> Key: LUCENE-8369
> URL: https://issues.apache.org/jira/browse/LUCENE-8369
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/spatial
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Attachments: LUCENE-8369.patch
>
>
> The "spatial" module is at this juncture nearly empty with only a couple 
> utilities that aren't used by anything in the entire codebase -- 
> GeoRelationUtils, and MortonEncoder.  Perhaps it should have been removed 
> earlier in LUCENE-7664 which was the removal of GeoPointField which was 
> essentially why the module existed.  Better late than never.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8865) Use incoming thread for execution if IndexSearcher has an executor

2019-07-22 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890429#comment-16890429
 ] 

Michael McCandless commented on LUCENE-8865:


Alas, I ran our internal benchmarks (production queries on production 
documents, measuring red-line QPS and long-pole query latencies at 10% 
capacity) and I could not measure any change due to this fix – it seems to be 
in the noise.

I was hoping for a small gain due to one fewer thread context switch ... but I 
still think the change is a good one!  Thanks [~simonw]!

>  Use incoming thread for execution if IndexSearcher has an executor
> ---
>
> Key: LUCENE-8865
> URL: https://issues.apache.org/jira/browse/LUCENE-8865
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Simon Willnauer
>Priority: Major
> Fix For: master (9.0), 8.2
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Today we don't utilize the incoming thread for a search when IndexSearcher
> has an executor. This thread is only idleing but can be used to execute a 
> search
> once all other collectors are dispatched.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8884) Add Directory wrapper to track per-query IO counters

2019-07-18 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-8884:
---
Attachment: LUCENE-8884.patch
Status: Open  (was: Open)

Trying again to attach first cut patch!

> Add Directory wrapper to track per-query IO counters
> 
>
> Key: LUCENE-8884
> URL: https://issues.apache.org/jira/browse/LUCENE-8884
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/store
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-8884.patch
>
>
> Lucene's IO abstractions ({{Directory, IndexInput/Output}}) make it really 
> easy to track counters of how many IOPs and net bytes are read for each 
> query, which is a useful metric to track/aggregate/alarm on in production or 
> dev benchmarks.
> At my day job we use these wrappers in our nightly benchmarks to catch any 
> accidental performance regressions.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8884) Add Directory wrapper to track per-query IO counters

2019-07-17 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887154#comment-16887154
 ] 

Michael McCandless commented on LUCENE-8884:


Argh!!  Not sure how I messed that up ... I’ll fix once I have access to laptop 
again.  Thanks for checking [~jpountz]!

> Add Directory wrapper to track per-query IO counters
> 
>
> Key: LUCENE-8884
> URL: https://issues.apache.org/jira/browse/LUCENE-8884
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/store
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
>
> Lucene's IO abstractions ({{Directory, IndexInput/Output}}) make it really 
> easy to track counters of how many IOPs and net bytes are read for each 
> query, which is a useful metric to track/aggregate/alarm on in production or 
> dev benchmarks.
> At my day job we use these wrappers in our nightly benchmarks to catch any 
> accidental performance regressions.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8884) Add Directory wrapper to track per-query IO counters

2019-07-16 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-8884:
---
Status: Open  (was: Open)

Here's an initial patch, adding {{IOTrackingDirectoryWrapper.}}

Whenever a given thread is "working" on a particular query it must first call 
{{setQueryForThread}} so the wrapper knows which query's counters to increment.

It tracks number of IOPs and how many total bytes were read.

It's likely it impacts search performance, so it should only be used during 
profiling/benchmarking.

> Add Directory wrapper to track per-query IO counters
> 
>
> Key: LUCENE-8884
> URL: https://issues.apache.org/jira/browse/LUCENE-8884
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/store
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
>
> Lucene's IO abstractions ({{Directory, IndexInput/Output}}) make it really 
> easy to track counters of how many IOPs and net bytes are read for each 
> query, which is a useful metric to track/aggregate/alarm on in production or 
> dev benchmarks.
> At my day job we use these wrappers in our nightly benchmarks to catch any 
> accidental performance regressions.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-8884) Add Directory wrapper to track per-query IO counters

2019-07-16 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-8884:
--

Assignee: Michael McCandless

> Add Directory wrapper to track per-query IO counters
> 
>
> Key: LUCENE-8884
> URL: https://issues.apache.org/jira/browse/LUCENE-8884
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/store
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
>
> Lucene's IO abstractions ({{Directory, IndexInput/Output}}) make it really 
> easy to track counters of how many IOPs and net bytes are read for each 
> query, which is a useful metric to track/aggregate/alarm on in production or 
> dev benchmarks.
> At my day job we use these wrappers in our nightly benchmarks to catch any 
> accidental performance regressions.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8069) Allow index sorting by field length

2019-07-09 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881211#comment-16881211
 ] 

Michael McCandless commented on LUCENE-8069:


+1, results are impressive.

> Allow index sorting by field length
> ---
>
> Key: LUCENE-8069
> URL: https://issues.apache.org/jira/browse/LUCENE-8069
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Adrien Grand
>Priority: Minor
>
> Short documents are more likely to get higher scores, so sorting an index by 
> field length would mean we would be likely to collect best matches first. 
> Depending on the similarity implementation, this might even allow to early 
> terminate collection of top documents on term queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8311) Leverage impacts for phrase queries

2019-07-09 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881203#comment-16881203
 ] 

Michael McCandless commented on LUCENE-8311:


+1 to merge ... that is a good tradeoff!  Astronomical speedups for 
{{PhraseQuery}} and some small slowdowns in others.  It's important that all of 
our common queries properly handle impacts.

> Leverage impacts for phrase queries
> ---
>
> Key: LUCENE-8311
> URL: https://issues.apache.org/jira/browse/LUCENE-8311
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8311.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now that we expose raw impacts, we could leverage them for phrase queries.
> For instance for exact phrases, we could take the minimum term frequency for 
> each unique norm value in order to get upper bounds of the score for the 
> phrase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4312) Index format to store position length per position

2019-07-09 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881195#comment-16881195
 ] 

Michael McCandless commented on LUCENE-4312:


+1 to build on payloads, today, to break the chicken/egg situation.  This 
should just be a {{TokenFilter}} that converts {{PositionLengthAttribute}} into 
payloads?  Then [~mgibney] could contribute query-time code that can decode 
these payloads and implement correct positional queries.  Once these prove 
useful we can circle back later and optimize how we store position lengths in 
the index.

> Index format to store position length per position
> --
>
> Key: LUCENE-4312
> URL: https://issues.apache.org/jira/browse/LUCENE-4312
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 6.0
>Reporter: Gang Luo
>Priority: Minor
>  Labels: Suggestion
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and 
> Codec APIs) to store an additional int position length per position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-07-09 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881120#comment-16881120
 ] 

Michael McCandless commented on LUCENE-8753:


+1 to push this new codec in Lucene, e.g. under codecs or sandbox or misc 
modules, if we can avoid making changes to other sources (once LUCENE-8906 is 
fixed).

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8906) Lucene50PostingsReader.postings() casts BlockTermState param to private IntBlockTermState

2019-07-09 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881109#comment-16881109
 ] 

Michael McCandless commented on LUCENE-8906:


+1 to simply make {{IntBlockTermState}} public.

> Lucene50PostingsReader.postings() casts BlockTermState param to private 
> IntBlockTermState
> -
>
> Key: LUCENE-8906
> URL: https://issues.apache.org/jira/browse/LUCENE-8906
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Bruno Roustant
>Priority: Major
>
> Lucene50PostingsReader is the public API that offers the postings() method to 
> read the postings. Any PostingFormat can use it (as well as 
> Lucene50PostingsWriter) to read/write postings.
> But the postings() method asks for a (public) BlockTermState param which is 
> internally cast to the private IntBlockTermState. This BlockTermState is 
> provided by Lucene50PostingsReader.newTermState().
> public PostingsEnum postings(FieldInfo fieldInfo, BlockTermState termState, 
> PostingsEnum reuse, int flags)
> This actually makes impossible to a custom PostingFormat customizing the 
> Block file structure to use this postings() method by providing their 
> (Int)BlockTermState, because they cannot access the FP fields of the 
> IntBlockTermState returned by PostingsReaderBase.newTermState().
> Proposed change:
>  * Either make IntBlockTermState public, as well as its fields.
>  * Or replace it by an interface in the postings() method. In this case the 
> IntBlockTermState fields currently accessed directly would be replaced by 
> getter/setter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8878) Provide alternative sorting utility from SortField other than FieldComparator

2019-07-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879274#comment-16879274
 ] 

Michael McCandless commented on LUCENE-8878:


{quote}I believe you are talking about Scorer#setMinCompetitiveScore, ie. 
changing the FieldComparator API to only track the bottom bucket as opposed to 
every bucket? If this is the case I agree that it sounds like a good idea to 
explore.
{quote}
Ahh, yes, that ;)  +1

> Provide alternative sorting utility from SortField other than FieldComparator
> -
>
> Key: LUCENE-8878
> URL: https://issues.apache.org/jira/browse/LUCENE-8878
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: 8.1.1
>Reporter: Tony Xu
>Priority: Major
>
> The `FieldComparator` has many responsibilities and users get all of them at 
> once. At high level the main functionalities of `FieldComparator` are
>  * Provide LeafFieldComparator
>  * Allocate storage for requested number of hits
>  * Read the values from DocValues/Custom source etc.
>  * Compare two values
> There are two major areas for improvement
>  # The logic of reading values and storing them are coupled.
>  # User need to specify the size in order to create a `FieldComparator` but 
> sometimes the size is unknown upfront.
>  # From `FieldComparator`'s API, one can't reason about thread-safety so it 
> is not suitable for concurrent search.
>  E.g. Can two concurrent thread use the same `FieldComparator` to call 
> `getLeafComparator` for two different segments they are working on? In fact, 
> almost all existing implementations of `FieldComparator` are not thread-safe.
> The proposal is to enhance `SortField` with two APIs
>  # {color:#14892c}int compare(Object v1, Object v2){color} – this is to 
> compare two values from different docs for this field
>  # {color:#14892c}ValueAccessor newValueAccessor(LeafReaderContext 
> leaf){color} – This encapsulate the logic for obtaining the right 
> implementation in order to read the field values.
>  `ValueAccessor` should be accessed in a similar way as `DocValues` to 
> provide the sort value for a document in an advance & read fashion.
> With this API, hopefully we can reduce the memory usage when using 
> `FieldComparator` because the users either store the sort values or at least 
> the slot number besides the storage allocated by `FieldComparator` itself. 
> Ideally, only once copy of the values should be stored.
> The proposed API is also more friendly to concurrent search since it provides 
> the `ValueAccessor` per leaf. Although same `ValueAccessor` can't be shared 
> if there are more than one thread working on the same leaf, at least they can 
> initialize their own `ValueAccessor`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8781) Explore FST direct array arc encoding

2019-06-30 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875903#comment-16875903
 ] 

Michael McCandless commented on LUCENE-8781:


+1 to do the simple approach even if it costs a little performance, and to 
delete the unused method in {{oal.util.fst.Util}}.

This is an experimental codec that implements an optional terms dict API that 
assigns a {{long}} ordinal to each term.

> Explore FST direct array arc encoding 
> --
>
> Key: LUCENE-8781
> URL: https://issues.apache.org/jira/browse/LUCENE-8781
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: FST-2-4.png, FST-6-9.png, FST-size.png
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> This issue is for exploring an alternate FST encoding of Arcs as full-sized 
> arrays so Arcs are addressed directly by label, avoiding binary search that 
> we use today for arrays of Arcs. PR: 
> https://github.com/apache/lucene-solr/pull/657
> h3. Testing
> ant test passes. I added some unit tests that were helpful in uncovering bugs 
> while
> implementing which are more difficult to chase down when uncovered by the 
> randomized testing we already do. They don't really test anything new; 
> they're just more focused.
> I'm not sure why, but ant precommit failed for me with:
> {noformat}
>  ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls 
> failed while scanning class 
> 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
> (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
> info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
> referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
> {noformat}
> I also got Test2BFST running (it was originally timing out due to excessive 
> calls to ramBytesUsage(), which seems to have gotten slow), and it passed; 
> that change isn't include here.
> h4. Micro-benchmark
> I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
> conditions. 
> h5. English words
> A test of looking up existing words in a dictionary of ~17 English words 
> shows improvements; the numbers listed are % change in FST size, time to look 
> up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
> strings that are not in the dict. The comparison is against the current 
> codebase with the optimization disabled. A separate comparison of showed no 
> significant change of the baseline (no opto applied) vs the current master 
> FST impl with no code changes applied.
> ||  load=2||   load=4 ||  load=16 ||
> | +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |
> The "load factor" used for those measurements controls when direct array arc 
> encoding is used;
> namely when the number of outgoing arcs was > load * (max label - min label).
> h5. sequential and random terms
> The same test, with terms being a sequence of integers as strings shows a 
> larger improvement, around 20% (load=4). This is presumably the best case for 
> this delta, where every Arc is encoded as a direct lookup.
> When random lowercase ASCII strings are used, a smaller improvement of around 
> 4% is seen.
> h4. luceneutil
> Testing w/luceneutil (wikimediumall) we see improvements mostly in the 
> PKLookup case. Other results seem noisy, with perhaps a small improvment in 
> some of the queries.
> {noformat}
> TaskQPS base  StdDevQPS opto  StdDev  
>   Pct diff
>   OrHighHigh6.93  (3.0%)6.89  (3.1%)   
> -0.5% (  -6% -5%)
>OrHighMed   45.15  (3.9%)   44.92  (3.5%)   
> -0.5% (  -7% -7%)
> Wildcard8.72  (4.7%)8.69  (4.6%)   
> -0.4% (  -9% -9%)
>   AndHighLow  274.11  (2.6%)  273.58  (3.1%)   
> -0.2% (  -5% -5%)
>OrHighLow  241.41  (1.9%)  241.11  (3.5%)   
> -0.1% (  -5% -5%)
>   AndHighMed   52.23  (4.1%)   52.41  (5.3%)
> 0.3% (  -8% -   10%)
>  MedTerm 1026.24  (3.1%) 1030.52  (4.3%)
> 0.4% (  -6% -8%)
> HighTerm .10  (3.4%) 1116.70  (4.0%)
> 0.5% (  -6% -8%)
>HighTermDayOfYearSort   14.59  (8.2%)   14.73  (9.3%)
> 1.0% ( -15% -   20%)
>  AndHighHigh   13.45  (6.2%)   13.61  (4.4%)
> 1.2% (  -8% -   12%)
>HighTermMonthSort   63.09 (12.5%)   64.13 (10.9%)
> 1.6% ( -19% -   28%)
>  LowTerm 1338.94  (3.3%) 1383.90  (5

[jira] [Commented] (LUCENE-8878) Provide alternative sorting utility from SortField other than FieldComparator

2019-06-27 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874573#comment-16874573
 ] 

Michael McCandless commented on LUCENE-8878:


The recently added impacts have a similar use case, where we need to express to 
the {{ImpactsEnum}} what the "bottom" of our PQ is, I think?  Maybe we could 
take inspiration from that to simplify the comparator APIs or make them similar 
to how {{ImpactsEnum}} does it?

> Provide alternative sorting utility from SortField other than FieldComparator
> -
>
> Key: LUCENE-8878
> URL: https://issues.apache.org/jira/browse/LUCENE-8878
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: 8.1.1
>Reporter: Tony Xu
>Priority: Major
>
> The `FieldComparator` has many responsibilities and users get all of them at 
> once. At high level the main functionalities of `FieldComparator` are
>  * Provide LeafFieldComparator
>  * Allocate storage for requested number of hits
>  * Read the values from DocValues/Custom source etc.
>  * Compare two values
> There are two major areas for improvement
>  # The logic of reading values and storing them are coupled.
>  # User need to specify the size in order to create a `FieldComparator` but 
> sometimes the size is unknown upfront.
>  # From `FieldComparator`'s API, one can't reason about thread-safety so it 
> is not suitable for concurrent search.
>  E.g. Can two concurrent thread use the same `FieldComparator` to call 
> `getLeafComparator` for two different segments they are working on? In fact, 
> almost all existing implementations of `FieldComparator` are not thread-safe.
> The proposal is to enhance `SortField` with two APIs
>  # {color:#14892c}int compare(Object v1, Object v2){color} – this is to 
> compare two values from different docs for this field
>  # {color:#14892c}ValueAccessor newValueAccessor(LeafReaderContext 
> leaf){color} – This encapsulate the logic for obtaining the right 
> implementation in order to read the field values.
>  `ValueAccessor` should be accessed in a similar way as `DocValues` to 
> provide the sort value for a document in an advance & read fashion.
> With this API, hopefully we can reduce the memory usage when using 
> `FieldComparator` because the users either store the sort values or at least 
> the slot number besides the storage allocated by `FieldComparator` itself. 
> Ideally, only once copy of the values should be stored.
> The proposed API is also more friendly to concurrent search since it provides 
> the `ValueAccessor` per leaf. Although same `ValueAccessor` can't be shared 
> if there are more than one thread working on the same leaf, at least they can 
> initialize their own `ValueAccessor`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8884) Add Directory wrapper to track per-query IO counters

2019-06-26 Thread Michael McCandless (JIRA)
Michael McCandless created LUCENE-8884:
--

 Summary: Add Directory wrapper to track per-query IO counters
 Key: LUCENE-8884
 URL: https://issues.apache.org/jira/browse/LUCENE-8884
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/store
Reporter: Michael McCandless


Lucene's IO abstractions ({{Directory, IndexInput/Output}}) make it really easy 
to track counters of how many IOPs and net bytes are read for each query, which 
is a useful metric to track/aggregate/alarm on in production or dev benchmarks.

At my day job we use these wrappers in our nightly benchmarks to catch any 
accidental performance regressions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8865) Use incoming thread for execution if IndexSearcher has an executor

2019-06-26 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873293#comment-16873293
 ] 

Michael McCandless commented on LUCENE-8865:


I plan to test this using our production benchmarks ... will try to do that 
soon.

>  Use incoming thread for execution if IndexSearcher has an executor
> ---
>
> Key: LUCENE-8865
> URL: https://issues.apache.org/jira/browse/LUCENE-8865
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Simon Willnauer
>Priority: Major
> Fix For: master (9.0), 8.2
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Today we don't utilize the incoming thread for a search when IndexSearcher
> has an executor. This thread is only idleing but can be used to execute a 
> search
> once all other collectors are dispatched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8878) Provide alternative sorting utility from SortField other than FieldComparator

2019-06-25 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872270#comment-16872270
 ] 

Michael McCandless commented on LUCENE-8878:


+1 to simplify Lucene's comparator APIs – they are crazy complicated because 
they are "hiding" a priority queue underneath.  They look nothing like you'd 
expect a comparator to look like!  They were designed this way to sometimes 
enable int ordinal comparisons when sorting by string fields 
({{DocValuesType.SORTED}}) but I'm not sure all that API complexity is really 
worth the performance.

To access the values can we somehow use the existing {{FunctionValues}} classes?

> Provide alternative sorting utility from SortField other than FieldComparator
> -
>
> Key: LUCENE-8878
> URL: https://issues.apache.org/jira/browse/LUCENE-8878
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: 8.1.1
>Reporter: Tony Xu
>Priority: Major
>
> The `FieldComparator` has many responsibilities and users get all of them at 
> once. At high level the main functionalities of `FieldComparator` are
>  * Provide LeafFieldComparator
>  * Allocate storage for requested number of hits
>  * Read the values from DocValues/Custom source etc.
>  * Compare two values
> There are two major areas for improvement
>  # The logic of reading values and storing them are coupled.
>  # User need to specify the size in order to create a `FieldComparator` but 
> sometimes the size is unknown upfront.
>  # From `FieldComparator`'s API, one can't reason about thread-safety so it 
> is not suitable for concurrent search.
>  E.g. Can two concurrent thread use the same `FieldComparator` to call 
> `getLeafComparator` for two different segments they are working on? In fact, 
> almost all existing implementations of `FieldComparator` are not thread-safe.
> The proposal is to enhance `SortField` with two APIs
>  # {color:#14892c}int compare(Object v1, Object v2){color} – this is to 
> compare two values from different docs for this field
>  # {color:#14892c}ValueAccessor newValueAccessor(LeafReaderContext 
> leaf){color} – This encapsulate the logic for obtaining the right 
> implementation in order to read the field values.
>  `ValueAccessor` should be accessed in a similar way as `DocValues` to 
> provide the sort value for a document in an advance & read fashion.
> With this API, hopefully we can reduce the memory usage when using 
> `FieldComparator` because the users either store the sort values or at least 
> the slot number besides the storage allocated by `FieldComparator` itself. 
> Ideally, only once copy of the values should be stored.
> The proposed API is also more friendly to concurrent search since it provides 
> the `ValueAccessor` per leaf. Although same `ValueAccessor` can't be shared 
> if there are more than one thread working on the same leaf, at least they can 
> initialize their own `ValueAccessor`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8867) Optimise BKD tree for low cardinality leaves

2019-06-18 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867256#comment-16867256
 ] 

Michael McCandless commented on LUCENE-8867:


+1 to both of these optimizations – I suspect many use cases will have such 
duplicate values and we could see big reduction on index usage for the leaf 
blocks, and speedup if we do the comparison once per unique value instead of 
once per all values.

> Optimise BKD tree for low cardinality leaves
> 
>
> Key: LUCENE-8867
> URL: https://issues.apache.org/jira/browse/LUCENE-8867
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently if a leaf on the BKD tree contains only few values, then the leaf 
> is treated the same way as it all values are different. It many cases it can 
> be much more efficient to store the distinct values with the cardinality.
> In addition, in this case the method IntersectVisitor#visit(docId, byte[]) is 
> called n times with the same byte array but different docID. This issue 
> proposes to add a new method to the interface that accepts an array of docs 
> so it can be override by implementors and gain search performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8854) Can we do "doc at a time scoring" from the BKD tree for exact queries?

2019-06-13 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862929#comment-16862929
 ] 

Michael McCandless commented on LUCENE-8854:


{quote}moving points from a visitor API to a more cursor-style API that would 
allow us to walk freely the index of the KD tree.
{quote}
+1, that would enable exactly this kind of optimization.

Maybe, it's an optional way to consume/walk the BKD tree that applies in only 
certain situations.

> Can we do "doc at a time scoring" from the BKD tree for exact queries?
> --
>
> Key: LUCENE-8854
> URL: https://issues.apache.org/jira/browse/LUCENE-8854
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> Random idea: normally our point queries must walk the BKD tree, building up a 
> sparse or dense bitset as a 1st pass, then in 2nd pass run the "normal" query 
> scorers (postings, doc values), because the docids coming out across leaf 
> blocks are not in docid order, like postings and doc values.
> But, if the query is an exact point query, I think we tie break our within 
> leaf block sorts by docid, and that'd even apply across multiple leaf blocks 
> (if that value occurs enough times) and so for that case we could avoid the 2 
> passes and do it all in one pass maybe?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8854) Can we do "doc at a time scoring" from the BKD tree for exact queries?

2019-06-12 Thread Michael McCandless (JIRA)
Michael McCandless created LUCENE-8854:
--

 Summary: Can we do "doc at a time scoring" from the BKD tree for 
exact queries?
 Key: LUCENE-8854
 URL: https://issues.apache.org/jira/browse/LUCENE-8854
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless


Random idea: normally our point queries must walk the BKD tree, building up a 
sparse or dense bitset as a 1st pass, then in 2nd pass run the "normal" query 
scorers (postings, doc values), because the docids coming out across leaf 
blocks are not in docid order, like postings and doc values.

But, if the query is an exact point query, I think we tie break our within leaf 
block sorts by docid, and that'd even apply across multiple leaf blocks (if 
that value occurs enough times) and so for that case we could avoid the 2 
passes and do it all in one pass maybe?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8791) Add CollectorRescorer

2019-06-10 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16860335#comment-16860335
 ] 

Michael McCandless commented on LUCENE-8791:


{quote}the default interface takes an ExecutorManager.
{quote}
Hmm did you mean {{ExecutorService}} or {{CollectorManager}} here?  There is no 
{{ExecutorManger}} that I can see.

+1 to mark the {{ExecutorService}} ctor/setter as expert w/ javadocs that 
explain that it is not often needed to distribute collection work across 
concurrent threads.

> Add CollectorRescorer
> -
>
> Key: LUCENE-8791
> URL: https://issues.apache.org/jira/browse/LUCENE-8791
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Elbek Kamoliddinov
>Priority: Major
> Attachments: LUCENE-8791.patch, LUCENE-8791.patch, LUCENE-8791.patch, 
> LUCENE-8791.patch, LUCENE-8791.patch
>
>
> This is another implementation of query rescorer api (LUCENE-5489). It adds 
> rescoring functionality based on provided CollectorManager. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8823) IllegalStateException: wrong number of values added during doc values merge

2019-06-03 Thread Michael McCandless (JIRA)
Michael McCandless created LUCENE-8823:
--

 Summary: IllegalStateException: wrong number of values added 
during doc values merge
 Key: LUCENE-8823
 URL: https://issues.apache.org/jira/browse/LUCENE-8823
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 7.6
Reporter: Michael McCandless


Here's another mysterious exception we hit in production, on Lucene 7.x 
snapshot release (near 7.6), OpenJDK 11:
{noformat}
2019-05-31T05:49:22,443 [ERROR] (Lucene Merge Thread #0) 
com.amazon.lucene.util.UncaughtExceptionHandler: Uncaught exception: 
org.apache.lucene.index.MergePolicy$MergeException: 
java.lang.IllegalStateException: Wrong number of values added, expected: 97006, 
got: 95784 in thread Thread[Lucene Merge Thread #0,5,main] 
org.apache.lucene.index.MergePolicy$MergeException: 
java.lang.IllegalStateException: Wrong number of values added, expected: 97006, 
got: 95784
at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:704)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684)
Caused by: java.lang.IllegalStateException: Wrong number of values added, 
expected: 97006, got: 95784
at org.apache.lucene.util.packed.DirectWriter.finish(DirectWriter.java:94)
at 
org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.writeValuesSingleBlock(Lucene70DocValuesConsumer.java:283)
at 
org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.writeValues(Lucene70DocValuesConsumer.java:263)
at 
org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.addNumericField(Lucene70DocValuesConsumer.java:110)
at 
org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:175)
at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:135)
at 
org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:151)
at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:182)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:126)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4438)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4060)
at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625)
at 
com.amazon.lucene.index.ConcurrentMergeSchedulerWrapper.doMerge(ConcurrentMergeSchedulerWrapper.java:54)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662){noformat}
Merging of a numeric doc values field failed because too few values were added. 
 This may also be a JVM bug, though our doc values codec code is quite complex 
so it could also be a Lucene bug!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8822) UnsupportedOperationException: unused: not a comparsion-based sort during IndexWriter flush

2019-06-03 Thread Michael McCandless (JIRA)
Michael McCandless created LUCENE-8822:
--

 Summary: UnsupportedOperationException: unused: not a 
comparsion-based sort during IndexWriter flush
 Key: LUCENE-8822
 URL: https://issues.apache.org/jira/browse/LUCENE-8822
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 7.6
Reporter: Michael McCandless


We hit this very strange exception in production 7.x snapshot (near 7.6), 
OpenJDK 11:
{noformat}
Caused by: java.lang.UnsupportedOperationException: unused: not a 
comparison-based sort
at org.apache.lucene.util.MSBRadixSorter.compare(MSBRadixSorter.java:115)
at org.apache.lucene.util.Sorter.siftDown(Sorter.java:235)
at org.apache.lucene.util.Sorter.heapify(Sorter.java:228)
at 
org.apache.lucene.util.MSBRadixSorter.computeCommonPrefixLengthAndBuildHistogram(MSBRadixSorter.java:209)
at org.apache.lucene.util.MSBRadixSorter.radixSort(MSBRadixSorter.java:148)
at org.apache.lucene.util.MSBRadixSorter.radixSort(MSBRadixSorter.java:155)
at org.apache.lucene.util.MSBRadixSorter.sort(MSBRadixSorter.java:128)
at org.apache.lucene.util.MSBRadixSorter.sort(MSBRadixSorter.java:121)
at 
org.apache.lucene.util.bkd.MutablePointsReaderUtils.sort(MutablePointsReaderUtils.java:90)
at org.apache.lucene.util.bkd.BKDWriter.writeField1Dim(BKDWriter.java:497)
at org.apache.lucene.util.bkd.BKDWriter.writeField(BKDWriter.java:427)
at 
org.apache.lucene.codecs.lucene60.Lucene60PointsWriter.writeField(Lucene60PointsWriter.java:105)
at org.apache.lucene.index.PointValuesWriter.flush(PointValuesWriter.java:183)
at 
org.apache.lucene.index.DefaultIndexingChain.writePoints(DefaultIndexingChain.java:206)
at 
org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:141)
at 
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:470)
at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:554)
at 
org.apache.lucene.index.DocumentsWriter.flushOneDWPT(DocumentsWriter.java:257)
at org.apache.lucene.index.IndexWriter.flushNextBuffer(IndexWriter.java:3157)
at com.amazon.lucene.index.Indexer.lambda$commit$0(Indexer.java:1129){noformat}
The exception makes no sense to me: when I look at 
{{MSBRadixSorter.computeCommonPrefixLengthAndBuildHistogram}} at that line it 
does NOT invoke {{Sorter.heapify}} so I'm mystified.  Maybe this is a JVM bug 
...

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8791) Add CollectorRescorer

2019-05-30 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852068#comment-16852068
 ] 

Michael McCandless commented on LUCENE-8791:


{quote}In my opinion, rescorers should be used on the very top hits only.
{quote}
Is the concern that this rescorer optionally accepts {{ExecutorService}} to 
distribute the work across threads?

Maybe we could just add another ctor *not* taking {{ExecutorService}} for those 
use cases that want to run single-threaded?

 

> Add CollectorRescorer
> -
>
> Key: LUCENE-8791
> URL: https://issues.apache.org/jira/browse/LUCENE-8791
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Elbek Kamoliddinov
>Priority: Major
> Attachments: LUCENE-8791.patch, LUCENE-8791.patch, LUCENE-8791.patch
>
>
> This is another implementation of query rescorer api (LUCENE-5489). It adds 
> rescoring functionality based on provided CollectorManager. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8757) Better Segment To Thread Mapping Algorithm

2019-05-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844005#comment-16844005
 ] 

Michael McCandless commented on LUCENE-8757:


{quote}Your last patch sorts in reverse order of docBase, it should sort by the 
natural order?
{quote}
Hmm can we add a test case or an assertion somewhere that would fail if this 
happens again in the future?

> Better Segment To Thread Mapping Algorithm
> --
>
> Key: LUCENE-8757
> URL: https://issues.apache.org/jira/browse/LUCENE-8757
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Assignee: Simon Willnauer
>Priority: Major
> Attachments: LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch, 
> LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch
>
>
> The current segments to threads allocation algorithm always allocates one 
> thread per segment. This is detrimental to performance in case of skew in 
> segment sizes since small segments also get their dedicated thread. This can 
> lead to performance degradation due to context switching overheads.
>  
> A better algorithm which is cognizant of size skew would have better 
> performance for realistic scenarios



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8804) FieldType attribute map should not be modifiable after freeze

2019-05-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844003#comment-16844003
 ] 

Michael McCandless commented on LUCENE-8804:


+1 to the issue and patch, great catch, thanks [~vamshi]!

> FieldType attribute map should not be modifiable after freeze
> -
>
> Key: LUCENE-8804
> URL: https://issues.apache.org/jira/browse/LUCENE-8804
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 8.0
>Reporter: Vamshi Vijay Nakkirtha
>Priority: Minor
>  Labels: features, patch
> Attachments: LUCENE-8804.patch
>
>
> Today FieldType attribute map can be modifiable even after freeze. For all 
> other properties of FieldType, we do "checkIfFrozen()" before making the 
> update to the property but for attribute map, we does not seem to make such 
> check. 
>  
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.0.0/lucene/core/src/java/org/apache/lucene/document/FieldType.java#L363]
> we may need to add check at the beginning of the function similar to other 
> properties setters.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8757) Better Segment To Thread Mapping Algorithm

2019-05-08 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835502#comment-16835502
 ] 

Michael McCandless commented on LUCENE-8757:


Are the work units tackled in order for each query?  I.e. is the queue a FIFO 
queue?  If so, the sorting can be useful since {{IndexSearcher}} would work 
first on the hardest/slowest work units, the "long poles" for the concurrent 
search?

> Better Segment To Thread Mapping Algorithm
> --
>
> Key: LUCENE-8757
> URL: https://issues.apache.org/jira/browse/LUCENE-8757
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Attachments: LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch
>
>
> The current segments to threads allocation algorithm always allocates one 
> thread per segment. This is detrimental to performance in case of skew in 
> segment sizes since small segments also get their dedicated thread. This can 
> lead to performance degradation due to context switching overheads.
>  
> A better algorithm which is cognizant of size skew would have better 
> performance for realistic scenarios



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8785) TestIndexWriterDelete.testDeleteAllNoDeadlock failure

2019-05-08 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835500#comment-16835500
 ] 

Michael McCandless commented on LUCENE-8785:


Thank you [~simonw]!  Love how open-source works ;)  Lucene gets better.

> TestIndexWriterDelete.testDeleteAllNoDeadlock failure
> -
>
> Key: LUCENE-8785
> URL: https://issues.apache.org/jira/browse/LUCENE-8785
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.6
> Environment: OpenJDK 1.8.0_202
>Reporter: Michael McCandless
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 7.7.2, master (9.0), 8.2, 8.1.1
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I was running Lucene's core tests on an {{i3.16xlarge}} EC2 instance (64 
> cores), and hit this random yet spooky failure:
> {noformat}
>    [junit4]   2> NOTE: reproduce with: ant test  
> -Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllNoDeadLock 
> -Dtests.seed=952BE262BA547C1 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=ar-YE -Dtests.timezone=Europe/Lisbon -Dtests.as\
> serts=true -Dtests.file.encoding=US-ASCII
>    [junit4] ERROR   0.16s J3 | TestIndexWriterDelete.testDeleteAllNoDeadLock 
> <<<
>    [junit4]    > Throwable #1: 
> com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an 
> uncaught exception in thread: Thread[id=36, name=Thread-2, state=RUNNABLE, 
> group=TGRP-TestIndexWriterDelete]
>    [junit4]    >        at 
> __randomizedtesting.SeedInfo.seed([952BE262BA547C1:3A4B5138AB66FD97]:0)
>    [junit4]    > Caused by: java.lang.RuntimeException: 
> java.lang.IllegalArgumentException: field number 0 is already mapped to field 
> name "null", not "content"
>    [junit4]    >        at 
> __randomizedtesting.SeedInfo.seed([952BE262BA547C1]:0)
>    [junit4]    >        at 
> org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:332)
>    [junit4]    > Caused by: java.lang.IllegalArgumentException: field number 
> 0 is already mapped to field name "null", not "content"
>    [junit4]    >        at 
> org.apache.lucene.index.FieldInfos$FieldNumbers.verifyConsistent(FieldInfos.java:310)
>    [junit4]    >        at 
> org.apache.lucene.index.FieldInfos$Builder.getOrAdd(FieldInfos.java:415)
>    [junit4]    >        at 
> org.apache.lucene.index.DefaultIndexingChain.getOrAddField(DefaultIndexingChain.java:650)
>    [junit4]    >        at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:428)
>    [junit4]    >        at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
>    [junit4]    >        at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297)
>    [junit4]    >        at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450)
>    [junit4]    >        at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291)
>    [junit4]    >        at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264)
>    [junit4]    >        at 
> org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:159)
>    [junit4]    >        at 
> org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:326){noformat}
> It does *not* reproduce unfortunately ... but maybe there is some subtle 
> thread safety issue in this code ... this is a hairy part of Lucene ;)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8757) Better Segment To Thread Mapping Algorithm

2019-05-08 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835498#comment-16835498
 ] 

Michael McCandless commented on LUCENE-8757:


Whoa, fast iterations over here!

I think there is an important justification for the 2nd criteria (number of 
segments in each work unit / slice), which is if you have an index with some 
large segments, and then with a long tail of small segments (easily happens if 
your machine has substantially CPU concurrency and you use multiple threads), 
since there is a fixed cost for visiting each segment, if you put too many 
small segments into one work unit, those fixed costs multiply and that one work 
unit can become too slow even though it's not actually going to visit too many 
documents.

I think we should keep it?

Re: the choice of the constants – I ran some performance tests quite a while 
ago on our production data/queries and a machine with sizable concurrency 
({{i3.16xlarge}}) and found those two constants to be a sweet spot at the time.

But let's also remember: this is simply a default segment -> work units 
assignment, and expert users can always continue to override.  Good defaults 
are important ;)

> Better Segment To Thread Mapping Algorithm
> --
>
> Key: LUCENE-8757
> URL: https://issues.apache.org/jira/browse/LUCENE-8757
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Attachments: LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch
>
>
> The current segments to threads allocation algorithm always allocates one 
> thread per segment. This is detrimental to performance in case of skew in 
> segment sizes since small segments also get their dedicated thread. This can 
> lead to performance degradation due to context switching overheads.
>  
> A better algorithm which is cognizant of size skew would have better 
> performance for realistic scenarios



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8791) Add CollectorRescorer

2019-05-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832327#comment-16832327
 ] 

Michael McCandless commented on LUCENE-8791:


{quote}Please have a look again at the spacing. In general, it would be good if 
the code was a bit more readable w.r.t spacing around braces, breaking the code 
into logical paragraphes.
{quote}
The spacing/indenting looks correct to me – it seems to match Lucene's coding 
guidelines ([https://wiki.apache.org/lucene-java/DeveloperTips]).

 

> Add CollectorRescorer
> -
>
> Key: LUCENE-8791
> URL: https://issues.apache.org/jira/browse/LUCENE-8791
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Elbek Kamoliddinov
>Priority: Major
> Attachments: LUCENE-8791.patch
>
>
> This is another implementation of query rescorer api (LUCENE-5489). It adds 
> rescoring functionality based on provided CollectorManager. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8756) MLT queries ignore custom term frequencies

2019-05-02 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832139#comment-16832139
 ] 

Michael McCandless commented on LUCENE-8756:


Ugh, sorry!  Thank you [~cpoerschke]!

> MLT queries ignore custom term frequencies
> --
>
> Key: LUCENE-8756
> URL: https://issues.apache.org/jira/browse/LUCENE-8756
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Affects Versions: 7.0, 7.0.1, 7.1, 7.2, 7.2.1, 7.3, 7.4, 7.3.1, 7.5, 7.6, 
> 7.7, 7.7.1, 8.0
>Reporter: Olli Kuonanoja
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The MLT queries ignore any custom term frequencies for the like-texts and 
> uses a hard-coded frequency of 1 per occurrence. I have prepared a test-case 
> to demonstrate the issue and a fix proposal 
> https://github.com/ollik1/lucene-solr/commit/9dbbce2af26698cec1ac82a526d9cee60a880678
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8790) Spooky exception merging doc values

2019-05-02 Thread Michael McCandless (JIRA)
Michael McCandless created LUCENE-8790:
--

 Summary: Spooky exception merging doc values
 Key: LUCENE-8790
 URL: https://issues.apache.org/jira/browse/LUCENE-8790
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/index
Affects Versions: 7.5
 Environment: We are on a Lucene 7.x snapshot, githash 

935b0c89c6ecb446d7f05d938207760cd64bcd04, using the default Codec, with a 
static sort.
Reporter: Michael McCandless


We hit this exciting exception; we don't have a test case reproducing it, and 
staring at the code, I don't see how we can hit a {{NullPointerException}} on 
this line:
{noformat}
[May 2, 2019, 7:24 PM] Barrowman, Adam: 2019-05-02T18:32:10,561 [ERROR] (Lucene 
Merge Thread #1) com.amazon.lucene.util.UncaughtExceptionHandler: Uncaught 
exception: org.apache.lucene.index.MergePolicy$MergeException: 
java.lang.NullPointerException in thread Thread[Lucene Merge Thread #1,5,main] 
org.apache.lucene.index.MergePolicy$MergeException: 
java.lang.NullPointerException
at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:704)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684)
Caused by: java.lang.NullPointerException
at 
org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.writeValuesSingleBlock(Lucene70DocValuesConsumer.java:279)
at 
org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.writeValues(Lucene70DocValuesConsumer.java:263)
at 
org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.addSortedNumericField(Lucene70DocValuesConsumer.java:536)
at 
org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:371)
at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:143)
at 
org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:151)
at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:182)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:126)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4438)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4060)
at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625)
at 
com.amazon.lucene.index.ConcurrentMergeSchedulerWrapper.doMerge(ConcurrentMergeSchedulerWrapper.java:54)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662)
{noformat}
It seems like the {{encode.get(v)}} somehow returned null, which should not 
happen as long as the values we iterated from the {{SortedNumericValues}} were 
the same up above (in {{writeValues}}) and in {{writeValuesSingleBlock}}.  
Confused...

Note that we are using a 7.x snapshot, so it is possible this was a bug in 7.x 
at that time, fixed before the next 7.x release though when I compare the 
affected code against 8.x backwards codec, it looks the same.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8785) TestIndexWriterDelete.testDeleteAllNoDeadlock failure

2019-05-02 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832036#comment-16832036
 ] 

Michael McCandless commented on LUCENE-8785:


{quote}If there is another thread coming in after we locked the existent 
threadstates we just issue a new one.
{quote}
Yuck :(
{quote}I think we can just do what deleteAll() does today except of not 
dropping the schema on the floor?
{quote}
The thing is, I think erasing schema while under transaction is a useful 
feature of Lucene.  I realize neither ES nor Solr expose deleteAll but I don't 
think that's a valid argument to remove it from Lucene ;)
{quote}I want to understand the usecase for this. I can see how somebody wants 
to drop all docs but basically droping all IW state on the floor is difficult 
in my eyes.
{quote}
Well, imagine a user searching documents with diverse/varying fields, maybe 
arriving from an external (not controlled by the developer) source.  And for 
some reason the index is reset once per week, but the devs want to allow 
searching of the old index while the new index is (slowly) built up.  But if 
something goes badly wrong, they need to be able to rollback (the {{deleteAll}} 
and all subsequently added docs) to the last commit and try again later.  If 
instead it succeeds, then a refresh/commit will switch to the new index 
atomically.

> TestIndexWriterDelete.testDeleteAllNoDeadlock failure
> -
>
> Key: LUCENE-8785
> URL: https://issues.apache.org/jira/browse/LUCENE-8785
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.6
> Environment: OpenJDK 1.8.0_202
>Reporter: Michael McCandless
>Priority: Minor
>
> I was running Lucene's core tests on an {{i3.16xlarge}} EC2 instance (64 
> cores), and hit this random yet spooky failure:
> {noformat}
>    [junit4]   2> NOTE: reproduce with: ant test  
> -Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllNoDeadLock 
> -Dtests.seed=952BE262BA547C1 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=ar-YE -Dtests.timezone=Europe/Lisbon -Dtests.as\
> serts=true -Dtests.file.encoding=US-ASCII
>    [junit4] ERROR   0.16s J3 | TestIndexWriterDelete.testDeleteAllNoDeadLock 
> <<<
>    [junit4]    > Throwable #1: 
> com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an 
> uncaught exception in thread: Thread[id=36, name=Thread-2, state=RUNNABLE, 
> group=TGRP-TestIndexWriterDelete]
>    [junit4]    >        at 
> __randomizedtesting.SeedInfo.seed([952BE262BA547C1:3A4B5138AB66FD97]:0)
>    [junit4]    > Caused by: java.lang.RuntimeException: 
> java.lang.IllegalArgumentException: field number 0 is already mapped to field 
> name "null", not "content"
>    [junit4]    >        at 
> __randomizedtesting.SeedInfo.seed([952BE262BA547C1]:0)
>    [junit4]    >        at 
> org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:332)
>    [junit4]    > Caused by: java.lang.IllegalArgumentException: field number 
> 0 is already mapped to field name "null", not "content"
>    [junit4]    >        at 
> org.apache.lucene.index.FieldInfos$FieldNumbers.verifyConsistent(FieldInfos.java:310)
>    [junit4]    >        at 
> org.apache.lucene.index.FieldInfos$Builder.getOrAdd(FieldInfos.java:415)
>    [junit4]    >        at 
> org.apache.lucene.index.DefaultIndexingChain.getOrAddField(DefaultIndexingChain.java:650)
>    [junit4]    >        at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:428)
>    [junit4]    >        at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
>    [junit4]    >        at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297)
>    [junit4]    >        at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450)
>    [junit4]    >        at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291)
>    [junit4]    >        at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264)
>    [junit4]    >        at 
> org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:159)
>    [junit4]    >        at 
> org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:326){noformat}
> It does *not* reproduce unfortunately ... but maybe there is some subtle 
> thread safety issue in this code ... this is a hairy part of Lucene ;)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6687) MLT term frequency calculation bug

2019-05-02 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832027#comment-16832027
 ] 

Michael McCandless commented on LUCENE-6687:


OK thanks [~teofili] – I'll backport this soon.

> MLT term frequency calculation bug
> --
>
> Key: LUCENE-6687
> URL: https://issues.apache.org/jira/browse/LUCENE-6687
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring, core/queryparser
>Affects Versions: 5.2.1, 6.0
> Environment: OS X v10.10.4; Solr 5.2.1
>Reporter: Marko Bonaci
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: 5.2.2, master (9.0)
>
> Attachments: LUCENE-6687.patch, LUCENE-6687.patch, LUCENE-6687.patch, 
> LUCENE-6687.patch, buggy-method-usage.png, 
> solr-mlt-tf-doubling-bug-results.png, 
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png, 
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png, 
> solr-mlt-tf-doubling-bug.png, terms-accumulator.png, terms-angry.png, 
> terms-glass.png, terms-how.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In {{org.apache.lucene.queries.mlt.MoreLikeThis}}, there's a method 
> {{retrieveTerms}} that receives a {{Map}} of fields, i.e. a document 
> basically, but it doesn't have to be an existing doc.
> !solr-mlt-tf-doubling-bug.png|height=500!
> There are 2 for loops, one inside the other, which both loop through the same 
> set of fields.
> That effectively doubles the term frequency for all the terms from fields 
> that we provide in MLT QP {{qf}} parameter. 
> It basically goes two times over the list of fields and accumulates the term 
> frequencies from all fields into {{termFreqMap}}.
> The private method {{retrieveTerms}} is only called from one public method, 
> the version of overloaded method {{like}} that receives a Map: so that 
> private class member {{fieldNames}} is always derived from 
> {{retrieveTerms}}'s argument {{fields}}.
>  
> Uh, I don't understand what I wrote myself, but that basically means that, by 
> the time {{retrieveTerms}} method gets called, its parameter fields and 
> private member {{fieldNames}} always contain the same list of fields.
> Here's the proof:
> These are the final results of the calculation:
> !solr-mlt-tf-doubling-bug-results.png|height=700!
> And this is the actual {{thread_id:TID0009}} document, where those values 
> were derived from (from fields {{title_mlt}} and {{pagetext_mlt}}):
> !terms-glass.png|height=100!
> !terms-angry.png|height=100!
> !terms-how.png|height=100!
> !terms-accumulator.png|height=100!
> Now, let's further test this hypothesis by seeing MLT QP in action from the 
> AdminUI.
> Let's try to find docs that are More Like doc {{TID0009}}. 
> Here's the interesting part, the query:
> {code}
> q={!mlt qf=pagetext_mlt,title_mlt mintf=14 mindf=2 minwl=3 maxwl=15}TID0009
> {code}
> We just saw, in the last image above, that the term accumulator appears {{7}} 
> times in {{TID0009}} doc, but the {{accumulator}}'s TF was calculated as 
> {{14}}.
> By using {{mintf=14}}, we say that, when calculating similarity, we don't 
> want to consider terms that appear less than 14 times (when terms from fields 
> {{title_mlt}} and {{pagetext_mlt}} are merged together) in {{TID0009}}.
> I added the term accumulator in only one other document ({{TID0004}}), where 
> it appears only once, in the field {{title_mlt}}. 
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png|height=500!
> Let's see what happens when we use {{mintf=15}}:
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png|height=500!
> I should probably mention that multiple fields ({{qf}}) work because I 
> applied the patch: 
> [SOLR-7143|https://issues.apache.org/jira/browse/SOLR-7143].
> Bug, no?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8785) TestIndexWriterDelete.testDeleteAllNoDeadlock failure

2019-05-02 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831618#comment-16831618
 ] 

Michael McCandless commented on LUCENE-8785:


But at the point we call {{clear()}} haven't we already blocked all indexing 
threads?

I also dislike {{deleteAll()}} and you're right a user could deleteByQuery 
using MatchAllDocsQuery; can we make that close-ish as efficient as 
{{deleteAll()}} is today?  Though indeed that would preserve the schema, while 
{{deleteAll()}} let's you delete docs, delete schema, all under transaction 
(the change is not visible until commit).  I'm torn on just removing that ...

> TestIndexWriterDelete.testDeleteAllNoDeadlock failure
> -
>
> Key: LUCENE-8785
> URL: https://issues.apache.org/jira/browse/LUCENE-8785
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.6
> Environment: OpenJDK 1.8.0_202
>Reporter: Michael McCandless
>Priority: Minor
>
> I was running Lucene's core tests on an {{i3.16xlarge}} EC2 instance (64 
> cores), and hit this random yet spooky failure:
> {noformat}
>    [junit4]   2> NOTE: reproduce with: ant test  
> -Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllNoDeadLock 
> -Dtests.seed=952BE262BA547C1 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=ar-YE -Dtests.timezone=Europe/Lisbon -Dtests.as\
> serts=true -Dtests.file.encoding=US-ASCII
>    [junit4] ERROR   0.16s J3 | TestIndexWriterDelete.testDeleteAllNoDeadLock 
> <<<
>    [junit4]    > Throwable #1: 
> com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an 
> uncaught exception in thread: Thread[id=36, name=Thread-2, state=RUNNABLE, 
> group=TGRP-TestIndexWriterDelete]
>    [junit4]    >        at 
> __randomizedtesting.SeedInfo.seed([952BE262BA547C1:3A4B5138AB66FD97]:0)
>    [junit4]    > Caused by: java.lang.RuntimeException: 
> java.lang.IllegalArgumentException: field number 0 is already mapped to field 
> name "null", not "content"
>    [junit4]    >        at 
> __randomizedtesting.SeedInfo.seed([952BE262BA547C1]:0)
>    [junit4]    >        at 
> org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:332)
>    [junit4]    > Caused by: java.lang.IllegalArgumentException: field number 
> 0 is already mapped to field name "null", not "content"
>    [junit4]    >        at 
> org.apache.lucene.index.FieldInfos$FieldNumbers.verifyConsistent(FieldInfos.java:310)
>    [junit4]    >        at 
> org.apache.lucene.index.FieldInfos$Builder.getOrAdd(FieldInfos.java:415)
>    [junit4]    >        at 
> org.apache.lucene.index.DefaultIndexingChain.getOrAddField(DefaultIndexingChain.java:650)
>    [junit4]    >        at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:428)
>    [junit4]    >        at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
>    [junit4]    >        at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297)
>    [junit4]    >        at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450)
>    [junit4]    >        at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291)
>    [junit4]    >        at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264)
>    [junit4]    >        at 
> org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:159)
>    [junit4]    >        at 
> org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:326){noformat}
> It does *not* reproduce unfortunately ... but maybe there is some subtle 
> thread safety issue in this code ... this is a hairy part of Lucene ;)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8785) TestIndexWriterDelete.testDeleteAllNoDeadlock failure

2019-04-30 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-8785:
---
Environment: OpenJDK 1.8.0_202  (was: OpenJDK 11)

> TestIndexWriterDelete.testDeleteAllNoDeadlock failure
> -
>
> Key: LUCENE-8785
> URL: https://issues.apache.org/jira/browse/LUCENE-8785
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.6
> Environment: OpenJDK 1.8.0_202
>Reporter: Michael McCandless
>Priority: Minor
>
> I was running Lucene's core tests on an {{i3.16xlarge}} EC2 instance (64 
> cores), and hit this random yet spooky failure:
> {noformat}
>    [junit4]   2> NOTE: reproduce with: ant test  
> -Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllNoDeadLock 
> -Dtests.seed=952BE262BA547C1 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=ar-YE -Dtests.timezone=Europe/Lisbon -Dtests.as\
> serts=true -Dtests.file.encoding=US-ASCII
>    [junit4] ERROR   0.16s J3 | TestIndexWriterDelete.testDeleteAllNoDeadLock 
> <<<
>    [junit4]    > Throwable #1: 
> com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an 
> uncaught exception in thread: Thread[id=36, name=Thread-2, state=RUNNABLE, 
> group=TGRP-TestIndexWriterDelete]
>    [junit4]    >        at 
> __randomizedtesting.SeedInfo.seed([952BE262BA547C1:3A4B5138AB66FD97]:0)
>    [junit4]    > Caused by: java.lang.RuntimeException: 
> java.lang.IllegalArgumentException: field number 0 is already mapped to field 
> name "null", not "content"
>    [junit4]    >        at 
> __randomizedtesting.SeedInfo.seed([952BE262BA547C1]:0)
>    [junit4]    >        at 
> org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:332)
>    [junit4]    > Caused by: java.lang.IllegalArgumentException: field number 
> 0 is already mapped to field name "null", not "content"
>    [junit4]    >        at 
> org.apache.lucene.index.FieldInfos$FieldNumbers.verifyConsistent(FieldInfos.java:310)
>    [junit4]    >        at 
> org.apache.lucene.index.FieldInfos$Builder.getOrAdd(FieldInfos.java:415)
>    [junit4]    >        at 
> org.apache.lucene.index.DefaultIndexingChain.getOrAddField(DefaultIndexingChain.java:650)
>    [junit4]    >        at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:428)
>    [junit4]    >        at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
>    [junit4]    >        at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297)
>    [junit4]    >        at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450)
>    [junit4]    >        at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291)
>    [junit4]    >        at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264)
>    [junit4]    >        at 
> org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:159)
>    [junit4]    >        at 
> org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:326){noformat}
> It does *not* reproduce unfortunately ... but maybe there is some subtle 
> thread safety issue in this code ... this is a hairy part of Lucene ;)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8785) TestIndexWriterDelete.testDeleteAllNoDeadlock failure

2019-04-30 Thread Michael McCandless (JIRA)
Michael McCandless created LUCENE-8785:
--

 Summary: TestIndexWriterDelete.testDeleteAllNoDeadlock failure
 Key: LUCENE-8785
 URL: https://issues.apache.org/jira/browse/LUCENE-8785
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/index
Affects Versions: 7.6
 Environment: OpenJDK 11
Reporter: Michael McCandless


I was running Lucene's core tests on an {{i3.16xlarge}} EC2 instance (64 
cores), and hit this random yet spooky failure:
{noformat}
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllNoDeadLock 
-Dtests.seed=952BE262BA547C1 -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=ar-YE -Dtests.timezone=Europe/Lisbon -Dtests.as\

serts=true -Dtests.file.encoding=US-ASCII

   [junit4] ERROR   0.16s J3 | TestIndexWriterDelete.testDeleteAllNoDeadLock <<<

   [junit4]    > Throwable #1: 
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught 
exception in thread: Thread[id=36, name=Thread-2, state=RUNNABLE, 
group=TGRP-TestIndexWriterDelete]

   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([952BE262BA547C1:3A4B5138AB66FD97]:0)

   [junit4]    > Caused by: java.lang.RuntimeException: 
java.lang.IllegalArgumentException: field number 0 is already mapped to field 
name "null", not "content"

   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([952BE262BA547C1]:0)

   [junit4]    >        at 
org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:332)

   [junit4]    > Caused by: java.lang.IllegalArgumentException: field number 0 
is already mapped to field name "null", not "content"

   [junit4]    >        at 
org.apache.lucene.index.FieldInfos$FieldNumbers.verifyConsistent(FieldInfos.java:310)

   [junit4]    >        at 
org.apache.lucene.index.FieldInfos$Builder.getOrAdd(FieldInfos.java:415)

   [junit4]    >        at 
org.apache.lucene.index.DefaultIndexingChain.getOrAddField(DefaultIndexingChain.java:650)

   [junit4]    >        at 
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:428)

   [junit4]    >        at 
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)

   [junit4]    >        at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297)

   [junit4]    >        at 
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450)

   [junit4]    >        at 
org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291)

   [junit4]    >        at 
org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264)

   [junit4]    >        at 
org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:159)

   [junit4]    >        at 
org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:326){noformat}
It does *not* reproduce unfortunately ... but maybe there is some subtle thread 
safety issue in this code ... this is a hairy part of Lucene ;)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6687) MLT term frequency calculation bug

2019-04-30 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830448#comment-16830448
 ] 

Michael McCandless commented on LUCENE-6687:


Hmm it looks like this change was not backported to 8.x – was that intentional? 
 I'm having trouble backporting LUCENE-8756 because of this ... if it was 
unintentional, I'll just backport this change first.

Why do we show Fix Version 5.2.2?  Was it really backported to 5.2.x branch?

> MLT term frequency calculation bug
> --
>
> Key: LUCENE-6687
> URL: https://issues.apache.org/jira/browse/LUCENE-6687
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring, core/queryparser
>Affects Versions: 5.2.1, 6.0
> Environment: OS X v10.10.4; Solr 5.2.1
>Reporter: Marko Bonaci
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: 5.2.2, master (9.0)
>
> Attachments: LUCENE-6687.patch, LUCENE-6687.patch, LUCENE-6687.patch, 
> LUCENE-6687.patch, buggy-method-usage.png, 
> solr-mlt-tf-doubling-bug-results.png, 
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png, 
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png, 
> solr-mlt-tf-doubling-bug.png, terms-accumulator.png, terms-angry.png, 
> terms-glass.png, terms-how.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In {{org.apache.lucene.queries.mlt.MoreLikeThis}}, there's a method 
> {{retrieveTerms}} that receives a {{Map}} of fields, i.e. a document 
> basically, but it doesn't have to be an existing doc.
> !solr-mlt-tf-doubling-bug.png|height=500!
> There are 2 for loops, one inside the other, which both loop through the same 
> set of fields.
> That effectively doubles the term frequency for all the terms from fields 
> that we provide in MLT QP {{qf}} parameter. 
> It basically goes two times over the list of fields and accumulates the term 
> frequencies from all fields into {{termFreqMap}}.
> The private method {{retrieveTerms}} is only called from one public method, 
> the version of overloaded method {{like}} that receives a Map: so that 
> private class member {{fieldNames}} is always derived from 
> {{retrieveTerms}}'s argument {{fields}}.
>  
> Uh, I don't understand what I wrote myself, but that basically means that, by 
> the time {{retrieveTerms}} method gets called, its parameter fields and 
> private member {{fieldNames}} always contain the same list of fields.
> Here's the proof:
> These are the final results of the calculation:
> !solr-mlt-tf-doubling-bug-results.png|height=700!
> And this is the actual {{thread_id:TID0009}} document, where those values 
> were derived from (from fields {{title_mlt}} and {{pagetext_mlt}}):
> !terms-glass.png|height=100!
> !terms-angry.png|height=100!
> !terms-how.png|height=100!
> !terms-accumulator.png|height=100!
> Now, let's further test this hypothesis by seeing MLT QP in action from the 
> AdminUI.
> Let's try to find docs that are More Like doc {{TID0009}}. 
> Here's the interesting part, the query:
> {code}
> q={!mlt qf=pagetext_mlt,title_mlt mintf=14 mindf=2 minwl=3 maxwl=15}TID0009
> {code}
> We just saw, in the last image above, that the term accumulator appears {{7}} 
> times in {{TID0009}} doc, but the {{accumulator}}'s TF was calculated as 
> {{14}}.
> By using {{mintf=14}}, we say that, when calculating similarity, we don't 
> want to consider terms that appear less than 14 times (when terms from fields 
> {{title_mlt}} and {{pagetext_mlt}} are merged together) in {{TID0009}}.
> I added the term accumulator in only one other document ({{TID0004}}), where 
> it appears only once, in the field {{title_mlt}}. 
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png|height=500!
> Let's see what happens when we use {{mintf=15}}:
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png|height=500!
> I should probably mention that multiple fields ({{qf}}) work because I 
> applied the patch: 
> [SOLR-7143|https://issues.apache.org/jira/browse/SOLR-7143].
> Bug, no?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8756) MLT queries ignore custom term frequencies

2019-04-30 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830206#comment-16830206
 ] 

Michael McCandless commented on LUCENE-8756:


Great, thanks [~ollik1] – I'll push soon.

> MLT queries ignore custom term frequencies
> --
>
> Key: LUCENE-8756
> URL: https://issues.apache.org/jira/browse/LUCENE-8756
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Affects Versions: 7.0, 7.0.1, 7.1, 7.2, 7.2.1, 7.3, 7.4, 7.3.1, 7.5, 7.6, 
> 7.7, 7.7.1, 8.0
>Reporter: Olli Kuonanoja
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The MLT queries ignore any custom term frequencies for the like-texts and 
> uses a hard-coded frequency of 1 per occurrence. I have prepared a test-case 
> to demonstrate the issue and a fix proposal 
> https://github.com/ollik1/lucene-solr/commit/9dbbce2af26698cec1ac82a526d9cee60a880678
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8708) Can we simplify conjunctions of range queries automatically?

2019-04-30 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830184#comment-16830184
 ] 

Michael McCandless commented on LUCENE-8708:


Hmm why do we need the {{PointRangeQuery.ToStringInterface}}?  Also, why did we 
need to comment on that one test case – {{testInvalidPointLength}}?

> Can we simplify conjunctions of range queries automatically?
> 
>
> Key: LUCENE-8708
> URL: https://issues.apache.org/jira/browse/LUCENE-8708
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: interval_range_clauses_merging0704.patch
>
>
> BooleanQuery#rewrite already has some logic to make queries more efficient, 
> such as deduplicating filters or rewriting boolean queries that wrap a single 
> positive clause to that clause.
> It would be nice to also simplify conjunctions of range queries, so that eg. 
> {{foo: [5 TO *] AND foo:[* TO 20]}} would be rewritten to {{foo:[5 TO 20]}}. 
> When constructing queries manually or via the classic query parser, it feels 
> unnecessary as this is something that the user can fix easily. However if you 
> want to implement a query parser that only allows specifying one bound at 
> once, such as Gmail ({{after:2018-12-31}} 
> https://support.google.com/mail/answer/7190?hl=en) or GitHub 
> ({{updated:>=2018-12-31}} 
> https://help.github.com/en/articles/searching-issues-and-pull-requests#search-by-when-an-issue-or-pull-request-was-created-or-last-updated)
>  then you might end up with inefficient queries if the end user specifies 
> both an upper and a lower bound. It would be nice if we optimized those 
> automatically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8757) Better Segment To Thread Mapping Algorithm

2019-04-30 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830173#comment-16830173
 ] 

Michael McCandless commented on LUCENE-8757:


Thanks [~atris] – I agree it's important to have better defaults for how we 
coalesce segments into per-query-per-thread work units.  A few small comments:
 * Can you insert {{_}} in the big number constants (e.g. {{2500}})?  Makes 
it easier to read, and open-source code is written for reading :)
 * I think something is wrong with {{docSum}} – you only set it, and never add 
to it?  I think the intention is to sum up docs in multiple adjacent (sorted by 
{{maxDoc}}) segments until that count exceeds {{2500}}?
 * How did you pick {{2500}} and {{100}} as good constants?  We are using 
much smaller values in our production infrastructure – {{250_000}} and {{5}}, 
admittedly after only a little experimentation. 
 * Can you add some tests?  You can maybe make the slice method a package 
private static method and then create test cases with "interesting" 
{{LeafReaderContext}} combinations?  In particular, a test case exposing the 
{{docSum}} bug would be great, then fix that bug, then see the test case pass.

> Better Segment To Thread Mapping Algorithm
> --
>
> Key: LUCENE-8757
> URL: https://issues.apache.org/jira/browse/LUCENE-8757
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Attachments: LUCENE-8757.patch
>
>
> The current segments to threads allocation algorithm always allocates one 
> thread per segment. This is detrimental to performance in case of skew in 
> segment sizes since small segments also get their dedicated thread. This can 
> lead to performance degradation due to context switching overheads.
>  
> A better algorithm which is cognizant of size skew would have better 
> performance for realistic scenarios



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8756) MLT queries ignore custom term frequencies

2019-04-30 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830139#comment-16830139
 ] 

Michael McCandless commented on LUCENE-8756:


The change looks good – I left a couple minor comments – kinda freaky how Jira 
now tracks and posts how long I spend looking at a GitHub PR ;)  Thanks 
[~ollik1].

> MLT queries ignore custom term frequencies
> --
>
> Key: LUCENE-8756
> URL: https://issues.apache.org/jira/browse/LUCENE-8756
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Affects Versions: 7.0, 7.0.1, 7.1, 7.2, 7.2.1, 7.3, 7.4, 7.3.1, 7.5, 7.6, 
> 7.7, 7.7.1, 8.0
>Reporter: Olli Kuonanoja
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The MLT queries ignore any custom term frequencies for the like-texts and 
> uses a hard-coded frequency of 1 per occurrence. I have prepared a test-case 
> to demonstrate the issue and a fix proposal 
> https://github.com/ollik1/lucene-solr/commit/9dbbce2af26698cec1ac82a526d9cee60a880678
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8783) Add FST Offheap for non-default Codecs

2019-04-30 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830070#comment-16830070
 ] 

Michael McCandless commented on LUCENE-8783:


+1

> Add FST Offheap for non-default Codecs
> --
>
> Key: LUCENE-8783
> URL: https://issues.apache.org/jira/browse/LUCENE-8783
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/FSTs
>Reporter: Ankit Jain
>Priority: Major
> Fix For: 8.0, 8.x, master (9.0)
>
>
> Even though, LUCENE-8635 and LUCENE-8671 adds support to keep FST offheap for 
> default codec, there are many other codecs which do not support FST offheap. 
> Few examples are below:
> * CompletionPostingsFormat
> * BlockTreeOrdsPostingsFormat
> * IDVersionPostingsFormat



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

2019-04-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829561#comment-16829561
 ] 

Michael McCandless commented on LUCENE-8776:


[~venkat11] I'm sorry this change broke your use case.

I think allowing backwards offsets was an accidental but longstanding bug in 
prior versions of Lucene.  It is unfortunate your code came to rely on that 
bug, but we need to be able to fix our bugs and move forwards.

[~mgibney] a 3rd option in your list would be for [~venkat11] to fix his query 
parser to properly consume the graph, and generate fully accurate queries, the 
way Lucene's query parsers now do.  Then you can have precisely matching 
queries, no bugs.

> Start offset going backwards has a legitimate purpose
> -
>
> Key: LUCENE-8776
> URL: https://issues.apache.org/jira/browse/LUCENE-8776
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 7.6
>Reporter: Ram Venkat
>Priority: Major
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run 
> span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which 
> allows me to search for 'light', 'emitting' and 'diode' individually. The 
> three words occupy adjacent positions in the index, as 'light' adjacent to 
> 'emitting' and 'light' at a distance of two words from 'diode' need to match 
> this word. So, the order of words after splitting are: Organic, light, 
> emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two 
> positions: (a) In the same position as 'light' and (b) in the same position 
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets 
> are obviously the same. This works beautifully in Lucene 5.x in both 
> searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go 
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is 
> being thrown without any comments on why this check is needed. As I explained 
> above, startOffset going backwards is perfectly valid, to deal with word 
> splitting and span operations on these specialized use cases. On the other 
> hand, it is not clear what value is added by this check and which highlighter 
> code is affected by offsets going backwards. This same check is done at 
> BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but 
> it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8756) MLT queries ignore custom term frequencies

2019-04-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829556#comment-16829556
 ] 

Michael McCandless commented on LUCENE-8756:


Ahh thanks for the ping [~ollik1] I agree we need to fix this; I'll have a look 
at the PR, thanks!

> MLT queries ignore custom term frequencies
> --
>
> Key: LUCENE-8756
> URL: https://issues.apache.org/jira/browse/LUCENE-8756
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Affects Versions: 7.0, 7.0.1, 7.1, 7.2, 7.2.1, 7.3, 7.4, 7.3.1, 7.5, 7.6, 
> 7.7, 7.7.1, 8.0
>Reporter: Olli Kuonanoja
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The MLT queries ignore any custom term frequencies for the like-texts and 
> uses a hard-coded frequency of 1 per occurrence. I have prepared a test-case 
> to demonstrate the issue and a fix proposal 
> https://github.com/ollik1/lucene-solr/commit/9dbbce2af26698cec1ac82a526d9cee60a880678
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

2019-04-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825332#comment-16825332
 ] 

Michael McCandless commented on LUCENE-8776:


I think your use case can be properly handled as a token graph, without offsets 
going backwards, if you set proper {{PositionLengthAttribute}} for each token; 
indeed it's for exactly cases like this that we added 
{{PositionLengthAttribute}}.

Give your {{light-emitting-diode}} token {{PositionLengthAttribute=3}} so that 
the consumer of the tokens knows it spans over the three separate tokens 
({{light}}, {{emitting}} and {{diode}}).

To get correct behavior you must do this analysis at query time, and Lucene's 
query parsers will properly interpret the resulting graph and query the index 
correctly.  Unfortunately, you cannot properly index a token graph: Lucene 
discards the {{PositionLengthAttribute}} which is why if you really want to 
index a token graph you should insert a {{FlattenGraphFilter}} at the end of 
your chain.  This still discards information (loses the graph-ness) but tries 
to do so minimizing how queries are broken.

> Start offset going backwards has a legitimate purpose
> -
>
> Key: LUCENE-8776
> URL: https://issues.apache.org/jira/browse/LUCENE-8776
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 7.6
>Reporter: Ram Venkat
>Priority: Major
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run 
> span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which 
> allows me to search for 'light', 'emitting' and 'diode' individually. The 
> three words occupy adjacent positions in the index, as 'light' adjacent to 
> 'emitting' and 'light' at a distance of two words from 'diode' need to match 
> this word. So, the order of words after splitting are: Organic, light, 
> emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two 
> positions: (a) In the same position as 'light' and (b) in the same position 
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets 
> are obviously the same. This works beautifully in Lucene 5.x in both 
> searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go 
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is 
> being thrown without any comments on why this check is needed. As I explained 
> above, startOffset going backwards is perfectly valid, to deal with word 
> splitting and span operations on these specialized use cases. On the other 
> hand, it is not clear what value is added by this check and which highlighter 
> code is affected by offsets going backwards. This same check is done at 
> BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but 
> it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809218#comment-16809218
 ] 

Michael McCandless commented on LUCENE-8753:


I think this is similar to the terms dictionary format Lucene used to have 
before {{BlockTree}}, still in Lucene's sources as {{BlockTermsReader/Writer}}. 
Terms are assigned to fixed sized blocks and only the minimum unique prefix 
needs to be enrolled in the terms index FST.  But being able to do binary 
search within a block is unique!  That's very cool.

It's curious you see gains e.g. for {{AndHighLow}} – are you also doing 
something different to encode/decode postings (not just terms dictionary)?

The 500K docs is a little small – can you post results on the full 
{{wikimediumall}} set of documents?

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809188#comment-16809188
 ] 

Michael McCandless commented on LUCENE-8753:


{quote}I think PKLookup should be disregarded until it's fixed: 
[https://github.com/mikemccand/luceneutil/issues/35] (feel free to comment 
there if people have opinions)
{quote}
Note that the title on that issue was misleading (backwards from the reality) – 
I just corrected it.

I don't think we should disregard {{PKLookup}} results: it's reporting the 
performance when looking up actual IDs that do exist in the index.  That is an 
interesting result, but it is odd you are seeing varying/inconsistent results.

Note that if you add {{-jira}} into the luceneutil benchmark command-line it 
will print results using the markup that Jira displays as a table, making it 
easier for everyone to read.

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8740) AssertionError FlattenGraphFilter

2019-03-28 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804188#comment-16804188
 ] 

Michael McCandless commented on LUCENE-8740:


Maybe a dup of https://issues.apache.org/jira/browse/LUCENE-8723?

 

> AssertionError FlattenGraphFilter
> -
>
> Key: LUCENE-8740
> URL: https://issues.apache.org/jira/browse/LUCENE-8740
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 7.5, 8.0
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 8.1, master (9.0)
>
> Attachments: LUCENE-8740.patch
>
>
> Our unit tests picked up an unusual AssertionError in FlattenGraphFilter 
> which manifests itself only in very specific circumstances involving 
> WordDelimiterGraph, StopFilter, FlattenGraphFilter and MinhashFilter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8150) Remove references to segments.gen.

2019-03-19 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16796199#comment-16796199
 ] 

Michael McCandless commented on LUCENE-8150:


+1

> Remove references to segments.gen.
> --
>
> Key: LUCENE-8150
> URL: https://issues.apache.org/jira/browse/LUCENE-8150
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.1, master (9.0)
>
> Attachments: LUCENE-8150.patch, LUCENE-8150.patch
>
>
> This was the way we wrote pending segment files before we switch to 
> {{pending_segments_N}} in LUCENE-5925.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8150) Remove references to segments.gen.

2019-03-17 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794558#comment-16794558
 ] 

Michael McCandless commented on LUCENE-8150:


Hi [~jpountz], I fixed one issue with 
[http://jirasearch.mikemccandless.com|http://jirasearch.mikemccandless.com/], 
namely that it was incorrectly using strike-through for the issue id for issues 
that had "Status: PATCH AVAILABLE".  For example, this issue is no longer 
rendered with strikethrough.

Note that you can drill down on two different status if you hold the shift key 
when clicking on them; for example here are all issues that are Open, Reopened 
or Patch Available: 
[http://jirasearch.mikemccandless.com/search.py?chg=ddm&text=&a1=status&a2=Patch+Available&sort=recentlyUpdated&format=list&dd=project%3ALucene&dd=status%3AOpen%2CReopened]

 

> Remove references to segments.gen.
> --
>
> Key: LUCENE-8150
> URL: https://issues.apache.org/jira/browse/LUCENE-8150
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.1, master (9.0)
>
> Attachments: LUCENE-8150.patch
>
>
> This was the way we wrote pending segment files before we switch to 
> {{pending_segments_N}} in LUCENE-5925.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8150) Remove references to segments.gen.

2019-03-15 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793599#comment-16793599
 ] 

Michael McCandless commented on LUCENE-8150:


{quote}I think it's due to the fact that I'm always filtering by open issues on 
jirasearch, and it filters out issues that are marked as "patch available"
{quote}
Oh no!  Sorry :)  I will try to fix this.  Clearly 
[http://jirasearch.mikemccandless.com|http://jirasearch.mikemccandless.com/] is 
buggy here ... it seems to think issues that have patches are resolved?

> Remove references to segments.gen.
> --
>
> Key: LUCENE-8150
> URL: https://issues.apache.org/jira/browse/LUCENE-8150
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.1, master (9.0)
>
> Attachments: LUCENE-8150.patch
>
>
> This was the way we wrote pending segment files before we switch to 
> {{pending_segments_N}} in LUCENE-5925.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8150) Remove references to segments.gen.

2019-03-15 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793595#comment-16793595
 ] 

Michael McCandless commented on LUCENE-8150:


Hmm it looks like this was never committed?

> Remove references to segments.gen.
> --
>
> Key: LUCENE-8150
> URL: https://issues.apache.org/jira/browse/LUCENE-8150
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.1, master (9.0)
>
> Attachments: LUCENE-8150.patch
>
>
> This was the way we wrote pending segment files before we switch to 
> {{pending_segments_N}} in LUCENE-5925.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8542) Provide the LeafSlice to CollectorManager.newCollector to save memory on small index slices

2019-03-13 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16791706#comment-16791706
 ] 

Michael McCandless commented on LUCENE-8542:


+1 to improve slices() to aggregate small slices together by default; that's 
what we are doing in our production service – we combine up to 5 segments, up 
to 250K docs in aggregate.

> Provide the LeafSlice to CollectorManager.newCollector to save memory on 
> small index slices
> ---
>
> Key: LUCENE-8542
> URL: https://issues.apache.org/jira/browse/LUCENE-8542
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Christoph Kaser
>Priority: Minor
> Attachments: LUCENE-8542.patch
>
>
> I have an index consisting of 44 million documents spread across 60 segments. 
> When I run a query against this index with a huge number of results requested 
> (e.g. 5 million), this query uses more than 5 GB of heap if the IndexSearch 
> was configured to use an ExecutorService.
> (I know this kind of query is fairly unusual and it would be better to use 
> paging and searchAfter, but our architecture does not allow this at the 
> moment.)
> The reason for the huge memory requirement is that the search [will create a 
> TopScoreDocCollector for each 
> segment|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L404],
>  each one with numHits = 5 million. This is fine for the large segments, but 
> many of those segments are fairly small and only contain several thousand 
> documents. This wastes a huge amount of memory for queries with large values 
> of numHits on indices with many segments.
> Therefore, I propose to change the CollectorManager - interface in the 
> following way:
>  * change the method newCollector to accept a parameter LeafSlice that can be 
> used to determine the total count of documents in the LeafSlice
>  * Maybe, in order to remain backwards compatible, it would be possible to 
> introduce this as a new method with a default implementation that calls the 
> old method - otherwise, it probably has to wait for Lucene 8?
>  * This can then be used to cap numHits for each TopScoreDocCollector to the 
> leafslice-size.
> If this is something that would make sense for you, I can try to provide a 
> patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8720) Integer overflow bug in NameIntCacheLRU.makeRoomLRU()

2019-03-12 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-8720.

   Resolution: Fixed
Fix Version/s: (was: 7.1.1)
   8.1
   master (9.0)

> Integer overflow bug in NameIntCacheLRU.makeRoomLRU()
> -
>
> Key: LUCENE-8720
> URL: https://issues.apache.org/jira/browse/LUCENE-8720
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 7.7.1
> Environment: Mac OS X 10.11.6 but this bug is not affected by the 
> environment because it is a straightforward integer overflow bug.
>Reporter: Russell A Brown
>Priority: Major
>  Labels: easyfix, patch
> Fix For: master (9.0), 8.1
>
> Attachments: LUCENE-.patch
>
>
> The NameIntCacheLRU.makeRoomLRU() method has an integer overflow bug because 
> if maxCacheSize >= Integer.MAX_VALUE/2, 2*maxCacheSize will overflow to 
> -(2^30) and the value of n will overflow to a negative integer as well, which 
> will prevent any clearing of the cache whatsoever. Hence, performance will 
> degrade once the cache becomes full because it will be impossible to remove 
> any entries in order to add new entries to the cache.
> Moreover, comments in NameIntCacheLRU.java and LruTaxonomyWriterCache.java 
> indicate that 2/3 of the cache will be cleared, whereas in fact only 1/3 of 
> the cache is cleared. So as not to change the behavior of the 
> NameIntCacheLRU.makeRoomLRU() method, I have not changed the code to clear 
> 2/3 of the cache but instead I have changed the comments to indicate that 1/3 
> of the cache is cleared.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8717) Handle stop words that appear at articulation points

2019-03-12 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790698#comment-16790698
 ] 

Michael McCandless commented on LUCENE-8717:


+1 for {{TermDeletedAttribute}}.

Are we also fixing {{StopFilter}} to set {{TermDeletedAttribute}}?  Would this 
mean that a {{SynonymFilter}} trying to match a synonym containing a stop word 
would now match even when {{StopFilter}} before it marked the token deleted?

> Handle stop words that appear at articulation points
> 
>
> Key: LUCENE-8717
> URL: https://issues.apache.org/jira/browse/LUCENE-8717
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8717.patch, LUCENE-8717.patch
>
>
> Our set of TokenFilters currently cannot handle the case where a multi-term 
> synonym starts with a stopword.  This means that given a synonym file 
> containing the mapping "the walking dead => twd" and a standard english 
> stopword filter, QueryBuilder will produce incorrect queries.
> The tricky part here is that our standard way of dealing with stopwords, 
> which is to just remove them entirely from the token stream and use a larger 
> position increment on subsequent tokens, doesn't work when the removed token 
> also has a position length greater than 1.  There are various tricks you can 
> do to increment position length on the previous token, but this doesn't work 
> if the stopword is the first token in the token stream, or if there are 
> multiple stopwords in the side path.
> Instead, I'd like to propose adding a new TermDeletedAttribute, which we only 
> use on tokens that should be removed from the stream but which hold necessary 
> information about the structure of the token graph.  These tokens can then be 
> removed by GraphTokenStreamFiniteStrings at query time, and by 
> FlattenGraphFilter at index time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8692) IndexWriter.getTragicException() may not reflect all corrupting exceptions (notably: NoSuchFileException)

2019-03-12 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790690#comment-16790690
 ] 

Michael McCandless commented on LUCENE-8692:


{{rollback}} gives you a way to close {{IndexWriter}} without doing a commit, 
which seems useful.  If you removed that, what would users do instead?

> IndexWriter.getTragicException() may not reflect all corrupting exceptions 
> (notably: NoSuchFileException)
> -
>
> Key: LUCENE-8692
> URL: https://issues.apache.org/jira/browse/LUCENE-8692
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Hoss Man
>Priority: Major
> Attachments: LUCENE-8692.patch, LUCENE-8692.patch, LUCENE-8692.patch, 
> LUCENE-8692_test.patch
>
>
> Backstory...
> Solr has a "LeaderTragicEventTest" which uses MockDirectoryWrapper's 
> {{corruptFiles}} to introduce corruption into the "leader" node's index and 
> then assert that this solr node gives up it's leadership of the shard and 
> another replica takes over.
> This can currently fail sporadically (but usually reproducibly - see 
> SOLR-13237) due to the leader not giving up it's leadership even after the 
> corruption causes an update/commit to fail. Solr's leadership code makes this 
> decision after encountering an exception from the IndexWriter based on wether 
> {{IndexWriter.getTragicException()}} is (non-)null.
> 
> While investigating this, I created an isolated Lucene-Core equivilent test 
> that demonstrates the same basic situation:
>  * Gradually cause corruption on an index untill (otherwise) valid execution 
> of IW.add() + IW.commit() calls throw an exception to the IW client.
>  * assert that if an exception is thrown to the IW client, 
> {{getTragicException()}} is now non-null.
> It's fairly easy to make my new test fail reproducibly – in every situation 
> I've seen the underlying exception is a {{NoSuchFileException}} (ie: the 
> randomly introduced corruption was to delete some file).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8720) Integer overflow bug in NameIntCacheLRU.makeRoomLRU()

2019-03-12 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790681#comment-16790681
 ] 

Michael McCandless commented on LUCENE-8720:


Thanks [~kirigirisu], nice catch – I'll pass tests and push soon.

> Integer overflow bug in NameIntCacheLRU.makeRoomLRU()
> -
>
> Key: LUCENE-8720
> URL: https://issues.apache.org/jira/browse/LUCENE-8720
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 7.7.1
> Environment: Mac OS X 10.11.6 but this bug is not affected by the 
> environment because it is a straightforward integer overflow bug.
>Reporter: Russell A Brown
>Priority: Major
>  Labels: easyfix, patch
> Fix For: 7.1.1
>
> Attachments: LUCENE-.patch
>
>
> The NameIntCacheLRU.makeRoomLRU() method has an integer overflow bug because 
> if maxCacheSize >= Integer.MAX_VALUE/2, 2*maxCacheSize will overflow to 
> -(2^30) and the value of n will overflow to a negative integer as well, which 
> will prevent any clearing of the cache whatsoever. Hence, performance will 
> degrade once the cache becomes full because it will be impossible to remove 
> any entries in order to add new entries to the cache.
> Moreover, comments in NameIntCacheLRU.java and LruTaxonomyWriterCache.java 
> indicate that 2/3 of the cache will be cleared, whereas in fact only 1/3 of 
> the cache is cleared. So as not to change the behavior of the 
> NameIntCacheLRU.makeRoomLRU() method, I have not changed the code to clear 
> 2/3 of the cache but instead I have changed the comments to indicate that 1/3 
> of the cache is cleared.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8542) Provide the LeafSlice to CollectorManager.newCollector to save memory on small index slices

2019-03-12 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790677#comment-16790677
 ] 

Michael McCandless commented on LUCENE-8542:


Maybe we should try swapping in the JDK's {{PriorityQueue}} and measure if this 
really hurts search throughput?

> Provide the LeafSlice to CollectorManager.newCollector to save memory on 
> small index slices
> ---
>
> Key: LUCENE-8542
> URL: https://issues.apache.org/jira/browse/LUCENE-8542
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Christoph Kaser
>Priority: Minor
> Attachments: LUCENE-8542.patch
>
>
> I have an index consisting of 44 million documents spread across 60 segments. 
> When I run a query against this index with a huge number of results requested 
> (e.g. 5 million), this query uses more than 5 GB of heap if the IndexSearch 
> was configured to use an ExecutorService.
> (I know this kind of query is fairly unusual and it would be better to use 
> paging and searchAfter, but our architecture does not allow this at the 
> moment.)
> The reason for the huge memory requirement is that the search [will create a 
> TopScoreDocCollector for each 
> segment|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L404],
>  each one with numHits = 5 million. This is fine for the large segments, but 
> many of those segments are fairly small and only contain several thousand 
> documents. This wastes a huge amount of memory for queries with large values 
> of numHits on indices with many segments.
> Therefore, I propose to change the CollectorManager - interface in the 
> following way:
>  * change the method newCollector to accept a parameter LeafSlice that can be 
> used to determine the total count of documents in the LeafSlice
>  * Maybe, in order to remain backwards compatible, it would be possible to 
> introduce this as a new method with a default implementation that calls the 
> old method - otherwise, it probably has to wait for Lucene 8?
>  * This can then be used to cap numHits for each TopScoreDocCollector to the 
> leafslice-size.
> If this is something that would make sense for you, I can try to provide a 
> patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8542) Provide the LeafSlice to CollectorManager.newCollector to save memory on small index slices

2019-03-12 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790576#comment-16790576
 ] 

Michael McCandless commented on LUCENE-8542:


I think the core API change is quite minor and reasonable – letting the 
{{Collector.newCollector}} know which segments (slice) it will collect?  E.g. 
we already pass the {{LeafReaderContext}} to {{Collector.newLeafCollector}} so 
it's informed about the details of which segment it's about to collect.

 

I agree the motivating use case here is somewhat abusive, and a custom 
Collector is probably needed anyway, but I think this API change could help 
non-abusive cases too.

Alternatively we could explore fixing our default top hits collectors to not 
pre-allocate the full topN for every slice ... that is really unexpected 
behavior, and users have tripped up on this multiple times in the past causing 
us to make some partial fixes for it.

> Provide the LeafSlice to CollectorManager.newCollector to save memory on 
> small index slices
> ---
>
> Key: LUCENE-8542
> URL: https://issues.apache.org/jira/browse/LUCENE-8542
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Christoph Kaser
>Priority: Minor
> Attachments: LUCENE-8542.patch
>
>
> I have an index consisting of 44 million documents spread across 60 segments. 
> When I run a query against this index with a huge number of results requested 
> (e.g. 5 million), this query uses more than 5 GB of heap if the IndexSearch 
> was configured to use an ExecutorService.
> (I know this kind of query is fairly unusual and it would be better to use 
> paging and searchAfter, but our architecture does not allow this at the 
> moment.)
> The reason for the huge memory requirement is that the search [will create a 
> TopScoreDocCollector for each 
> segment|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L404],
>  each one with numHits = 5 million. This is fine for the large segments, but 
> many of those segments are fairly small and only contain several thousand 
> documents. This wastes a huge amount of memory for queries with large values 
> of numHits on indices with many segments.
> Therefore, I propose to change the CollectorManager - interface in the 
> following way:
>  * change the method newCollector to accept a parameter LeafSlice that can be 
> used to determine the total count of documents in the LeafSlice
>  * Maybe, in order to remain backwards compatible, it would be possible to 
> introduce this as a new method with a default implementation that calls the 
> old method - otherwise, it probably has to wait for Lucene 8?
>  * This can then be used to cap numHits for each TopScoreDocCollector to the 
> leafslice-size.
> If this is something that would make sense for you, I can try to provide a 
> patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8216) Better cross-field scoring

2019-02-23 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775884#comment-16775884
 ] 

Michael McCandless commented on LUCENE-8216:


Can this be resolved now?  Looks like [~jim.ferenczi] pushed the new query to 
sandbox?

> Better cross-field scoring
> --
>
> Key: LUCENE-8216
> URL: https://issues.apache.org/jira/browse/LUCENE-8216
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Major
> Fix For: 8.0
>
> Attachments: LUCENE-8216.patch, LUCENE-8216.patch
>
>
> I'd like Lucene to have better support for scoring across multiple fields. 
> Today we have BlendedTermQuery which tries to help there but it probably 
> tries to do too much on some aspects (handling cross-field term queries AND 
> synonyms) and too little on other ones (it tries to merge index-level 
> statistics, but not per-document statistics like tf and norm).
> Maybe we could implement something like BM25F so that queries across multiple 
> fields would retain the benefits of BM25 like the fact that the impact of the 
> term frequency saturates quickly, which is not the case with BlendedTermQuery 
> if you have occurrences across many fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8703) Build point writers only when needed on the BKD tree

2019-02-22 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775181#comment-16775181
 ] 

Michael McCandless commented on LUCENE-8703:


It'd be nice to have a metric in our nightly points benchmarks measuring how 
much heap was required while building the index.

> Build point writers only when needed on the BKD tree
> 
>
> Key: LUCENE-8703
> URL: https://issues.apache.org/jira/browse/LUCENE-8703
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
> Attachments: LUCENE-8703.patch, LUCENE-8703.patch, LUCENE-8703.patch
>
>
> With the introduction of LUCENE-8699, I have realised the BKD tree uses quite 
> a lot of heap even when it is not needed, for example for 1D points. 
> In this issue I propose to create point writers only when needed. In addition 
> I propose to create PointWriters based on the estimated point count given in 
> the constructor. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8671) Add setting for moving FST offheap/onheap

2019-02-22 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775190#comment-16775190
 ] 

Michael McCandless commented on LUCENE-8671:


[~akjain] that's true, maybe we don't need per-field control and a single 
boolean option would work?  We could maybe add a setter on 
{{BlockTreeTermsWriter}}?  And it'd write that setting into the index, and 
{{BlockTreeTermsReader}} would read that and then load FSTs on or off heap.

> Add setting for moving FST offheap/onheap
> -
>
> Key: LUCENE-8671
> URL: https://issues.apache.org/jira/browse/LUCENE-8671
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/FSTs, core/store
>Reporter: Ankit Jain
>Priority: Minor
> Attachments: offheap_generic_settings.patch, offheap_settings.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> While LUCENE-8635, adds support for loading FST offheap using mmap, users do 
> not have the  flexibility to specify fields for which FST needs to be 
> offheap. This allows users to tune heap usage as per their workload.
> Ideal way will be to add an attribute to FieldInfo, where we have 
> put/getAttribute. Then FieldReader can inspect the FieldInfo and pass the 
> appropriate On/OffHeapStore when creating its FST. It can support special 
> keywords like ALL/NONE.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap

2019-02-19 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-8635.

   Resolution: Fixed
Fix Version/s: master (9.0)
   8.x
   8.0

Thanks [~akjain]!

> Lazy loading Lucene FST offheap using mmap
> --
>
> Key: LUCENE-8635
> URL: https://issues.apache.org/jira/browse/LUCENE-8635
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/FSTs
> Environment: I used below setup for es_rally tests:
> single node i3.xlarge running ES 6.5
> es_rally was running on another i3.xlarge instance
>Reporter: Ankit Jain
>Priority: Major
> Fix For: 8.0, 8.x, master (9.0)
>
> Attachments: fst-offheap-ra-rev.patch, fst-offheap-rev.patch, 
> offheap.patch, optional_offheap_ra.patch, ra.patch, rally_benchmark.xlsx
>
>
> Currently, FST loads all the terms into heap memory during index open. This 
> causes frequent JVM OOM issues if the term size gets big. A better way of 
> doing this will be to lazily load FST using mmap. That ensures only the 
> required terms get loaded into memory.
>  
> Lucene can expose API for providing list of fields to load terms offheap. I'm 
> planning to take following approach for this:
>  # Add a boolean property fstOffHeap in FieldInfo
>  # Pass list of offheap fields to lucene during index open (ALL can be 
> special keyword for loading ALL fields offheap)
>  # Initialize the fstOffHeap property during lucene index open
>  # FieldReader invokes default FST constructor or OffHeap constructor based 
> on fstOffHeap field
>  
> I created a patch (that loads all fields offheap), did some benchmarks using 
> es_rally and results look good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap

2019-02-19 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772153#comment-16772153
 ] 

Michael McCandless commented on LUCENE-8635:


I ran luceneutil on {{wikimediumall}} with current trunk vs PR here – net/net 
looks like noise, which is great – I'll push shortly:
{noformat}
Report after iter 19:

                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff

                 Prefix3       37.05     (11.4%)       36.25     (13.0%)   
-2.1% ( -23% -   25%)
   BrowseMonthSSDVFacets        5.01      (6.4%)        4.91     (10.4%)   
-1.9% ( -17% -   15%)
   BrowseMonthTaxoFacets        1.24      (2.7%)        1.22      (4.8%)   
-1.3% (  -8% -    6%)
                Wildcard      106.53      (8.6%)      105.18      (9.1%)   
-1.3% ( -17% -   18%)
   HighTermDayOfYearSort       14.85      (4.2%)       14.70      (4.2%)   
-1.0% (  -9% -    7%)
    BrowseDateTaxoFacets        1.11      (3.2%)        1.10      (5.6%)   
-0.8% (  -9% -    8%)
BrowseDayOfYearTaxoFacets        1.11      (3.1%)        1.10      (5.6%)   
-0.8% (  -9% -    8%)
         MedSloppyPhrase        4.59      (3.4%)        4.56      (2.8%)   
-0.5% (  -6% -    5%)
                  Fuzzy2       68.49      (1.0%)       68.12      (1.3%)   
-0.5% (  -2% -    1%)
             LowSpanNear       30.34      (1.7%)       30.19      (1.9%)   
-0.5% (  -4% -    3%)
                  Fuzzy1       72.43      (0.9%)       72.10      (1.4%)   
-0.5% (  -2% -    1%)
               LowPhrase       34.35      (1.1%)       34.22      (2.0%)   
-0.4% (  -3% -    2%)
                 Respell       47.66      (1.4%)       47.48      (1.7%)   
-0.4% (  -3% -    2%)
         LowSloppyPhrase       10.59      (4.9%)       10.56      (3.6%)   
-0.3% (  -8% -    8%)
                HighTerm     1290.39      (1.8%)     1286.15      (1.4%)   
-0.3% (  -3% -    2%)
                 MedTerm     1419.25      (2.0%)     1415.23      (1.5%)   
-0.3% (  -3% -    3%)
                  IntNRQ       27.03     (11.0%)       26.96     (10.9%)   
-0.3% ( -19% -   24%)
        HighSloppyPhrase        6.73      (4.9%)        6.71      (3.4%)   
-0.3% (  -8% -    8%)
           OrNotHighHigh      825.79      (1.9%)      823.77      (1.4%)   
-0.2% (  -3% -    3%)
            OrNotHighMed      912.80      (1.3%)      910.96      (1.3%)   
-0.2% (  -2% -    2%)
               MedPhrase       29.52      (1.1%)       29.46      (1.9%)   
-0.2% (  -3% -    2%)
            OrHighNotLow     1184.54      (3.1%)     1182.86      (1.8%)   
-0.1% (  -4% -    4%)
                 LowTerm      974.30      (1.5%)      973.33      (1.4%)   
-0.1% (  -2% -    2%)
               OrHighLow      328.39      (1.0%)      328.13      (1.0%)   
-0.1% (  -2% -    1%)
             AndHighHigh       21.04      (2.8%)       21.03      (2.6%)   
-0.1% (  -5% -    5%)
           OrHighNotHigh      907.78      (1.8%)      907.93      (1.4%)    
0.0% (  -3% -    3%)
            OrHighNotMed     1019.49      (2.0%)     1019.67      (1.4%)    
0.0% (  -3% -    3%)
              AndHighMed       64.27      (1.1%)       64.33      (1.1%)    
0.1% (  -2% -    2%)
            OrNotHighLow      414.78      (1.2%)      415.43      (1.0%)    
0.2% (  -2% -    2%)
BrowseDayOfYearSSDVFacets        4.14      (6.9%)        4.15      (8.9%)    
0.2% ( -14% -   17%)
              AndHighLow      371.09      (1.7%)      371.84      (1.7%)    
0.2% (  -3% -    3%)
               OrHighMed       65.31      (1.8%)       65.45      (1.8%)    
0.2% (  -3% -    3%)
                PKLookup      141.21      (1.6%)      141.63      (1.9%)    
0.3% (  -3% -    3%)
            HighSpanNear       25.84      (2.8%)       25.94      (2.6%)    
0.4% (  -4% -    5%)
             MedSpanNear       26.39      (2.9%)       26.50      (2.8%)    
0.4% (  -5% -    6%)
              HighPhrase       11.72      (2.1%)       11.77      (1.9%)    
0.4% (  -3% -    4%)
              OrHighHigh       14.60      (2.2%)       14.69      (1.8%)    
0.6% (  -3% -    4%)
       HighTermMonthSort       31.51      (6.0%)       31.90      (6.0%)    
1.2% ( -10% -   14%){noformat}

> Lazy loading Lucene FST offheap using mmap
> --
>
> Key: LUCENE-8635
> URL: https://issues.apache.org/jira/browse/LUCENE-8635
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/FSTs
> Environment: I used below setup for es_rally tests:
> single node i3.xlarge running ES 6.5
> es_rally was running on another i3.xlarge instance
>Reporter: Ankit Jain
>Priority: Major
> Attachments: fst-offheap-ra-rev.patch, fst-offheap-rev.patch, 
> offheap.patch, optional_offheap_ra.patch, ra.patch, rally_benchmark.xlsx
>
>
> Currently, FST loads all the terms into heap memory 

[jira] [Commented] (LUCENE-8671) Add setting for moving FST offheap/onheap

2019-02-19 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772068#comment-16772068
 ] 

Michael McCandless commented on LUCENE-8671:


Actually I think this is a good use case for the existing attributes in 
{{FieldInfo}} – this sort of extensibility is exactly why we have attributes.

But can you use the existing {{attributes}} instead of adding a new 
{{readerAttributes}}?  And could we make this something a custom {{Codec}} impl 
would set?  Then we shouldn't need any changes to {{FieldInfo.java}}, 
{{IndexWriter.java}}, {{LiveIndexWriterConfig.java}}, etc.  We'd just make a 
custom codec setting this attribute for fields where we want to override 
Lucene's ({{BlockTreeTermReader}}'s) default behavior.  Yes, it'd mean one must 
commit at indexing time as to which fields will be on vs off heap at search 
time, but I think that's an OK tradeoff?

> Add setting for moving FST offheap/onheap
> -
>
> Key: LUCENE-8671
> URL: https://issues.apache.org/jira/browse/LUCENE-8671
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/FSTs, core/store
>Reporter: Ankit Jain
>Priority: Minor
> Attachments: offheap_generic_settings.patch, offheap_settings.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> While LUCENE-8635, adds support for loading FST offheap using mmap, users do 
> not have the  flexibility to specify fields for which FST needs to be 
> offheap. This allows users to tune heap usage as per their workload.
> Ideal way will be to add an attribute to FieldInfo, where we have 
> put/getAttribute. Then FieldReader can inspect the FieldInfo and pass the 
> appropriate On/OffHeapStore when creating its FST. It can support special 
> keywords like ALL/NONE.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8675) Divide Segment Search Amongst Multiple Threads

2019-02-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16759440#comment-16759440
 ] 

Michael McCandless commented on LUCENE-8675:


{quote}If some segments are getting large enough that intra-segment parallelism 
becomes appealing, then maybe an easier and more efficient way to increase 
parallelism is to instead reduce the maximum segment size so that inter-segment 
parallelism has more potential for parallelizing query execution.
{quote}
Yeah that is a good workaround given how Lucene works today.

It's essentially the same as your original suggestion ("make more shards and 
search them concurrently"), just at the segment instead of shard level.

But this still adds some costs -- the per-segment fixed cost for each query. 
That cost should be less than the per shard fixed cost in the sharded case, but 
it's still adding some cost.

If instead Lucene had a way to divide large segments into multiple work units 
(and I agree there are challenges with that! -- not just BKD and multi-term 
queries, but e.g. how would early termination work?) then we could pay that 
per-segment fixed cost once for such segments then let multiple threads share 
the variable cost work of finding and ranking hits.

In our recently launched production index we see sizable jumps in the P99+ 
query latencies when a large segment merges finish and replicate, because we 
are using "thread per segment" concurrency that we are hoping we could improve 
by pushing thread concurrency into individual large segments.

> Divide Segment Search Amongst Multiple Threads
> --
>
> Key: LUCENE-8675
> URL: https://issues.apache.org/jira/browse/LUCENE-8675
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Atri Sharma
>Priority: Major
>
> Segment search is a single threaded operation today, which can be a 
> bottleneck for large analytical queries which index a lot of data and have 
> complex queries which touch multiple segments (imagine a composite query with 
> range query and filters on top). This ticket is for discussing the idea of 
> splitting a single segment into multiple threads based on mutually exclusive 
> document ID ranges.
> This will be a two phase effort, the first phase targeting queries returning 
> all matching documents (collectors not terminating early). The second phase 
> patch will introduce staged execution and will build on top of this patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13190) Fuzzy search treated as server error instead of client error when terms are too complex

2019-02-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16759431#comment-16759431
 ] 

Michael McCandless commented on SOLR-13190:
---

+1 to improve the exception message to include the field and fuzzy term that 
led to this.

However, this exception is baffling because the way our FuzzyQuery works is to 
directly produce an already determinized and minimized automaton – that's the 
beauty of the (efficient) Levenshtein automaton construction algorithm.

So why are we then trying to determinize it again?  Something bad is lurking 
here – somehow we lost track that the automaton is already determinized?

> Fuzzy search treated as server error instead of client error when terms are 
> too complex
> ---
>
> Key: SOLR-13190
> URL: https://issues.apache.org/jira/browse/SOLR-13190
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: master (9.0)
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We've seen a fuzzy search end up breaking the automaton and getting reported 
> as a server error. This usage should be improved by
> 1) reporting as a client error, because it's similar to something like too 
> many boolean clauses queries in how an operator should deal with it
> 2) report what field is causing the error, since that currently must be 
> deduced from adjacent query logs and can be difficult if there are multiple 
> terms in the search
> This trigger was added to defend against adversarial regex but somehow hits 
> fuzzy terms as well, I don't understand enough about the automaton mechanisms 
> to really know how to approach a fix there, but improving the operability is 
> a good first step.
> relevant stack trace:
> {noformat}
> org.apache.lucene.util.automaton.TooComplexToDeterminizeException: 
> Determinizing automaton with 13632 states and 21348 transitions would result 
> in more than 1 states.
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:746)
>   at 
> org.apache.lucene.util.automaton.RunAutomaton.(RunAutomaton.java:69)
>   at 
> org.apache.lucene.util.automaton.ByteRunAutomaton.(ByteRunAutomaton.java:32)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:133)
>   at 
> org.apache.lucene.search.FuzzyTermsEnum.(FuzzyTermsEnum.java:143)
>   at org.apache.lucene.search.FuzzyQuery.getTermsEnum(FuzzyQuery.java:154)
>   at 
> org.apache.lucene.search.MultiTermQuery$RewriteMethod.getTermsEnum(MultiTermQuery.java:78)
>   at 
> org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:58)
>   at 
> org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
>   at 
> org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
>   at 
> org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:667)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:442)
>   at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:200)
>   at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1604)
>   at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1420)
>   at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:567)
>   at 
> org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1435)
>   at 
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:374)
>   at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:298)
>   at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:2559)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8675) Divide Segment Search Amongst Multiple Threads

2019-02-01 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758451#comment-16758451
 ] 

Michael McCandless commented on LUCENE-8675:


I think it'd be interesting to explore intra-segment parallelism, but I agree 
w/ [~jpountz] that there are challenges :)

If you pass an {{ExecutorService}} to {{IndexSearcher}} today you can already 
use multiple threads to answer one query, but the concurrency is tied to your 
segment geometry and annoyingly a supposedly "optimized" index gets no 
concurrency ;)

But if you do have many segments, this can give a nice reduction to query 
latencies when QPS is well below the searcher's red-line capacity (probably at 
the expense of some hopefully small loss of red-line throughput because of the 
added overhead of thread scheduling).  For certain use cases (large index, low 
typical query rate) this is a powerful approach.

It's true that one can also divide an index into more shards and run each shard 
concurrently but then you are also multiplying the fixed query setup cost which 
in some cases can be relatively significant.
{quote}Parallelizing based on ranges of doc IDs is problematic for some 
queries, for instance the cost of evaluating a range query over an entire 
segment or only about a specific range of doc IDs is exactly the same given 
that it uses data-structures that are organized by value rather than by doc ID.
{quote}
Yeah that's a real problem – these queries traverse the BKD tree per-segment 
while creating the scorer, which is/can be the costly part, and then produce a 
bit set which is very fast to iterate over.  This phase is not separately 
visible to the caller, unlike e.g. rewrite that MultiTermQueries use to 
translate into simpler queries, so it'd be tricky to build intra-segment 
concurrency on top ...

> Divide Segment Search Amongst Multiple Threads
> --
>
> Key: LUCENE-8675
> URL: https://issues.apache.org/jira/browse/LUCENE-8675
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Atri Sharma
>Priority: Major
>
> Segment search is a single threaded operation today, which can be a 
> bottleneck for large analytical queries which index a lot of data and have 
> complex queries which touch multiple segments (imagine a composite query with 
> range query and filters on top). This ticket is for discussing the idea of 
> splitting a single segment into multiple threads based on mutually exclusive 
> document ID ranges.
> This will be a two phase effort, the first phase targeting queries returning 
> all matching documents (collectors not terminating early). The second phase 
> patch will introduce staged execution and will build on top of this patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap

2019-02-01 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758431#comment-16758431
 ] 

Michael McCandless commented on LUCENE-8635:


{quote}Better would be an attribute of {{FieldInfo}}, where we have 
{{put/getAttribute}}. Then {{FieldReader }}can inspect the {{FieldInfo}} and 
pass the appropriate {{On/OffHeapStore}} when creating its {{FST}}. What do you 
think?
{quote}
Hmm that's also an interesting approach to get per-field control.  One can set 
these attributes in a custom {{FieldType}} when indexing documents, or maybe in 
a custom codec at write time (just subclassing e.g. {{Lucene80Codec}}), or at 
read time using a real (named) custom codec.  So we would pick a specific 
string ({{FST_OFF_HEAP}} or something) and define that as a string constant 
which users could then use for setting the attribute?

So ... maybe we have a default behavior w/ Adrien's cool idea, but then also 
allow the attribute to give per-field control?  We should probably also by 
default (if the field attribute is not present) not do off-heap when the 
directory is not MMapDirectory?  We haven't tested the other directory impls 
but I suspect they'd be quite a bit slower with off-heap FST?

 
{quote}Given that reversing the index during write to make it forward reading 
didn't help the performance (in addition to it not being backward compatible), 
is the consensus to add exception for PK and directories other than mmap for 
offheap FST in [^ra.patch]?
{quote}
Yeah +1 to keep the two changes separated.

 

> Lazy loading Lucene FST offheap using mmap
> --
>
> Key: LUCENE-8635
> URL: https://issues.apache.org/jira/browse/LUCENE-8635
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/FSTs
> Environment: I used below setup for es_rally tests:
> single node i3.xlarge running ES 6.5
> es_rally was running on another i3.xlarge instance
>Reporter: Ankit Jain
>Priority: Major
> Attachments: fst-offheap-ra-rev.patch, fst-offheap-rev.patch, 
> offheap.patch, optional_offheap_ra.patch, ra.patch, rally_benchmark.xlsx
>
>
> Currently, FST loads all the terms into heap memory during index open. This 
> causes frequent JVM OOM issues if the term size gets big. A better way of 
> doing this will be to lazily load FST using mmap. That ensures only the 
> required terms get loaded into memory.
>  
> Lucene can expose API for providing list of fields to load terms offheap. I'm 
> planning to take following approach for this:
>  # Add a boolean property fstOffHeap in FieldInfo
>  # Pass list of offheap fields to lucene during index open (ALL can be 
> special keyword for loading ALL fields offheap)
>  # Initialize the fstOffHeap property during lucene index open
>  # FieldReader invokes default FST constructor or OffHeap constructor based 
> on fstOffHeap field
>  
> I created a patch (that loads all fields offheap), did some benchmarks using 
> es_rally and results look good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap

2019-01-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755374#comment-16755374
 ] 

Michael McCandless commented on LUCENE-8635:


Oooh I like that proposal [~jpountz]!

> Lazy loading Lucene FST offheap using mmap
> --
>
> Key: LUCENE-8635
> URL: https://issues.apache.org/jira/browse/LUCENE-8635
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/FSTs
> Environment: I used below setup for es_rally tests:
> single node i3.xlarge running ES 6.5
> es_rally was running on another i3.xlarge instance
>Reporter: Ankit Jain
>Priority: Major
> Attachments: fst-offheap-ra-rev.patch, offheap.patch, 
> optional_offheap_ra.patch, ra.patch, rally_benchmark.xlsx
>
>
> Currently, FST loads all the terms into heap memory during index open. This 
> causes frequent JVM OOM issues if the term size gets big. A better way of 
> doing this will be to lazily load FST using mmap. That ensures only the 
> required terms get loaded into memory.
>  
> Lucene can expose API for providing list of fields to load terms offheap. I'm 
> planning to take following approach for this:
>  # Add a boolean property fstOffHeap in FieldInfo
>  # Pass list of offheap fields to lucene during index open (ALL can be 
> special keyword for loading ALL fields offheap)
>  # Initialize the fstOffHeap property during lucene index open
>  # FieldReader invokes default FST constructor or OffHeap constructor based 
> on fstOffHeap field
>  
> I created a patch (that loads all fields offheap), did some benchmarks using 
> es_rally and results look good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap

2019-01-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755344#comment-16755344
 ] 

Michael McCandless commented on LUCENE-8635:


OK net/net it looks like there is a small performance impact for some queries, 
and biggish (-7-8%) impact for {{PKLookup.}}

But this is a nice option to have for users who are heap constrained by the 
FSTs, so I wonder how we could add this option off by default?  E.g. users 
might want their {{id}} field to store the FST in heap (like today), but all 
other fields off-heap.

There is no index format change required here, which is nice, but Lucene 
doesn't make it easy to have read-time codec behavior changes, so maybe the 
solution is that at write-time we add an option e.g. to 
{{BlockTreeTermsWriter}} and it stores this in the index and then at read-time 
{{BlockTreeTermsReader}} checks that option and loads the FST accordingly?  
Then users could customize their codecs to achieve this.

Or I suppose we could add a global system property, e.g. our default stored 
fields writer has a property to turn on/off bulk merge, but I think we are 
trying not to use Java properties going forward?

Can anyone think of any other approaches to make this option possible?

> Lazy loading Lucene FST offheap using mmap
> --
>
> Key: LUCENE-8635
> URL: https://issues.apache.org/jira/browse/LUCENE-8635
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/FSTs
> Environment: I used below setup for es_rally tests:
> single node i3.xlarge running ES 6.5
> es_rally was running on another i3.xlarge instance
>Reporter: Ankit Jain
>Priority: Major
> Attachments: fst-offheap-ra-rev.patch, offheap.patch, 
> optional_offheap_ra.patch, ra.patch, rally_benchmark.xlsx
>
>
> Currently, FST loads all the terms into heap memory during index open. This 
> causes frequent JVM OOM issues if the term size gets big. A better way of 
> doing this will be to lazily load FST using mmap. That ensures only the 
> required terms get loaded into memory.
>  
> Lucene can expose API for providing list of fields to load terms offheap. I'm 
> planning to take following approach for this:
>  # Add a boolean property fstOffHeap in FieldInfo
>  # Pass list of offheap fields to lucene during index open (ALL can be 
> special keyword for loading ALL fields offheap)
>  # Initialize the fstOffHeap property during lucene index open
>  # FieldReader invokes default FST constructor or OffHeap constructor based 
> on fstOffHeap field
>  
> I created a patch (that loads all fields offheap), did some benchmarks using 
> es_rally and results look good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8653) Reverse FST storage so it can be read forward

2019-01-22 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748985#comment-16748985
 ] 

Michael McCandless commented on LUCENE-8653:


Impressive how simple this was!  I think it's simpler to think about, reading 
the {{byte[]}} in forward order, and it ought to be a bit more cache friendly.  
I agree jumping between FST nodes is very random access, but e.g. at a given 
node as we scan the arcs looking for a match that would become sequential byte 
reads with this change.  Curious the impact is neutral, but maybe if we combine 
this with LUCENE-8635 we can measure an impact?

> Reverse FST storage so it can be read forward
> -
>
> Key: LUCENE-8653
> URL: https://issues.apache.org/jira/browse/LUCENE-8653
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/FSTs
>Reporter: Mike Sokolov
>Priority: Major
> Attachments: fst-reverse.patch
>
>
> Discussion of keeping FST off-heap led to the idea of ensuring that FST's can 
> be read forward in order to be more cache-friendly and align better with 
> standard I/O practice. Today FSTs are read in reverse and this leads to some 
> awkwardness, and you can't use standard readers so the code can be confusing 
> to work with.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8618) MMapDirectory's read ahead on random-access files might trash the OS cache

2019-01-21 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16747963#comment-16747963
 ] 

Michael McCandless commented on LUCENE-8618:


Was the index cold(ish) in this use case?  I.e. the 2 MB read-ahead was 
consuming valuable IO resources that were better spent on the other IOPs 
actually needed for the use case.

> MMapDirectory's read ahead on random-access files might trash the OS cache
> --
>
> Key: LUCENE-8618
> URL: https://issues.apache.org/jira/browse/LUCENE-8618
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> At Elastic we were reported a case which runs significantly slower with 
> MMapDirectory than with NIOFSDirectory. After a long analysis, we discovered 
> that it had to do with MMapDirectory's read ahead of 2MB, which doesn't help 
> and even trashes the OS cache on stored fields and term vectors files which 
> have a fully random access pattern (except at merge time).
> The particular use-case that exhibits the slow-down is performing updates, 
> ie. we first look up a document based on its id, fetch stored fields, compute 
> new stored fields (eg. after adding or changing the value of a field) and add 
> the document back to the index. We were able to reproduce the workload that 
> this Elasticsearch user described and measured a median throughput of 3600 
> updates/s with MMapDirectory and 5000 updates/s with NIOFSDirectory. It even 
> goes up to 5600 updates/s if you configure a FileSwitchDirectory to use 
> MMapDirectory for the terms dictionary and NIOFSDirectory for stored fields 
> (postings files are not relevant here since postings are inlined in the terms 
> dict when docFreq=1 and indexOptions=DOCS).
> While it is possible to work around this issue on top of Lucene, maybe this 
> is something that we could improve directly in Lucene, eg. by propagating 
> information about the expected access pattern and avoiding mmap on files that 
> have a fully random access pattern (until Java exposes madvise in some way)?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap

2019-01-17 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745253#comment-16745253
 ] 

Michael McCandless commented on LUCENE-8635:


OK thanks [~sokolov].  I'll try to also run bench on wikibig and report back.  
I think doing a single method call instead of the two (seek + read) via 
{{RandomAccessInput}} must be helping.
{quote}The thing that makes me want to be careful here is that access to the 
terms index is very random, so things might degrade badly if the OS cache 
doesn't hold the whole terms index in memory.
{quote}
I think net/net we are already relying on OS to do the right thing here.  As 
things stand today, the OS could also swap out the heap pages that hold the 
FST's {{byte[]}} depending on its swappiness (on Linux). 
{quote}I'm not super familiar with the FST internals, I wonder whether there 
are changes that we could make to it so that it would be more disk-friendly, 
eg. by seeking backward as little as possible when looking up a key?
{quote}
{{We used to have a }}{{pack}} method in FST that would 1) try to further 
compress the {{byte[]}} size by moving nodes "closer" to the nodes that 
transitioned to them, and 2) reversing the bytes.  But we removed that method 
because it added complexity and nobody was really using it and sometimes it 
even made the FST bigger!

Maybe, we could bring the method back, but only part 2) of it, and always call 
it at the end of building an FST?  That should be simpler code (without part 
1), and should achieve sequential reads of at least the bytes to decode a 
single transition; maybe it gives a performance jump independent of this 
change?  But I think we really should explore that independently of this issue 
... I think as long as additional performance tests show only these smallish 
impacts to real queries we should just make the change across the board for 
terms dictionary index?

> Lazy loading Lucene FST offheap using mmap
> --
>
> Key: LUCENE-8635
> URL: https://issues.apache.org/jira/browse/LUCENE-8635
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/FSTs
> Environment: I used below setup for es_rally tests:
> single node i3.xlarge running ES 6.5
> es_rally was running on another i3.xlarge instance
>Reporter: Ankit Jain
>Priority: Major
> Attachments: offheap.patch, ra.patch, rally_benchmark.xlsx
>
>
> Currently, FST loads all the terms into heap memory during index open. This 
> causes frequent JVM OOM issues if the term size gets big. A better way of 
> doing this will be to lazily load FST using mmap. That ensures only the 
> required terms get loaded into memory.
>  
> Lucene can expose API for providing list of fields to load terms offheap. I'm 
> planning to take following approach for this:
>  # Add a boolean property fstOffHeap in FieldInfo
>  # Pass list of offheap fields to lucene during index open (ALL can be 
> special keyword for loading ALL fields offheap)
>  # Initialize the fstOffHeap property during lucene index open
>  # FieldReader invokes default FST constructor or OffHeap constructor based 
> on fstOffHeap field
>  
> I created a patch (that loads all fields offheap), did some benchmarks using 
> es_rally and results look good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap

2019-01-16 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16744538#comment-16744538
 ] 

Michael McCandless commented on LUCENE-8635:


Thanks [~sokolov] – those numbers look quite a bit better!  Though, your QPSs 
are kinda high overall – how many Wikipedia docs were in your index?

I do wonder if we simply reversed the FST's byte[] when we create it, what 
impact that'd have on lookup performance.  Hmm even if we did that, we'd still 
have to {{readBytes}} one byte at a time since {{RandomAccessInput}} does not 
have a {{readBytes}} method?  But ... maybe {{IndexInput}} would give good 
performance in that case?  We should probably pursue that separately though...

> Lazy loading Lucene FST offheap using mmap
> --
>
> Key: LUCENE-8635
> URL: https://issues.apache.org/jira/browse/LUCENE-8635
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/FSTs
> Environment: I used below setup for es_rally tests:
> single node i3.xlarge running ES 6.5
> es_rally was running on another i3.xlarge instance
>Reporter: Ankit Jain
>Priority: Major
> Attachments: offheap.patch, ra.patch, rally_benchmark.xlsx
>
>
> Currently, FST loads all the terms into heap memory during index open. This 
> causes frequent JVM OOM issues if the term size gets big. A better way of 
> doing this will be to lazily load FST using mmap. That ensures only the 
> required terms get loaded into memory.
>  
> Lucene can expose API for providing list of fields to load terms offheap. I'm 
> planning to take following approach for this:
>  # Add a boolean property fstOffHeap in FieldInfo
>  # Pass list of offheap fields to lucene during index open (ALL can be 
> special keyword for loading ALL fields offheap)
>  # Initialize the fstOffHeap property during lucene index open
>  # FieldReader invokes default FST constructor or OffHeap constructor based 
> on fstOffHeap field
>  
> I created a patch (that loads all fields offheap), did some benchmarks using 
> es_rally and results look good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap

2019-01-15 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743137#comment-16743137
 ] 

Michael McCandless commented on LUCENE-8635:


Thanks for testing [~sokolov] – the results make sense: the most terms 
dictionary intensive queries are impacted the most, with {{PKLookup}} being 
heavily impacted since that's just purely exercising the terms dictionary with 
no postings visited.  Fuzzy queries, and then queries matching few hits 
(conjunctions with low/medium freq terms) also spend relatively more time in 
the terms dictionary ...

So net/net it looks like we should not make this the default, but expose it 
somehow as an option for those use cases that don't want to dedicate heap 
memory to storing FSTs?

> Lazy loading Lucene FST offheap using mmap
> --
>
> Key: LUCENE-8635
> URL: https://issues.apache.org/jira/browse/LUCENE-8635
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/FSTs
> Environment: I used below setup for es_rally tests:
> single node i3.xlarge running ES 6.5
> es_rally was running on another i3.xlarge instance
>Reporter: Ankit Jain
>Priority: Major
> Attachments: offheap.patch, rally_benchmark.xlsx
>
>
> Currently, FST loads all the terms into heap memory during index open. This 
> causes frequent JVM OOM issues if the term size gets big. A better way of 
> doing this will be to lazily load FST using mmap. That ensures only the 
> required terms get loaded into memory.
>  
> Lucene can expose API for providing list of fields to load terms offheap. I'm 
> planning to take following approach for this:
>  # Add a boolean property fstOffHeap in FieldInfo
>  # Pass list of offheap fields to lucene during index open (ALL can be 
> special keyword for loading ALL fields offheap)
>  # Initialize the fstOffHeap property during lucene index open
>  # FieldReader invokes default FST constructor or OffHeap constructor based 
> on fstOffHeap field
>  
> I created a patch (that loads all fields offheap), did some benchmarks using 
> es_rally and results look good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap

2019-01-11 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740764#comment-16740764
 ] 

Michael McCandless commented on LUCENE-8635:


Also, have you confirmed that all tests pass when you switch to off heap FST 
storage always?

> Lazy loading Lucene FST offheap using mmap
> --
>
> Key: LUCENE-8635
> URL: https://issues.apache.org/jira/browse/LUCENE-8635
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/FSTs
> Environment: I used below setup for es_rally tests:
> single node i3.xlarge running ES 6.5
> es_rally was running on another i3.xlarge instance
>Reporter: Ankit Jain
>Priority: Major
> Attachments: offheap.patch, rally_benchmark.xlsx
>
>
> Currently, FST loads all the terms into heap memory during index open. This 
> causes frequent JVM OOM issues if the term size gets big. A better way of 
> doing this will be to lazily load FST using mmap. That ensures only the 
> required terms get loaded into memory.
>  
> Lucene can expose API for providing list of fields to load terms offheap. I'm 
> planning to take following approach for this:
>  # Add a boolean property fstOffHeap in FieldInfo
>  # Pass list of offheap fields to lucene during index open (ALL can be 
> special keyword for loading ALL fields offheap)
>  # Initialize the fstOffHeap property during lucene index open
>  # FieldReader invokes default FST constructor or OffHeap constructor based 
> on fstOffHeap field
>  
> I created a patch (that loads all fields offheap), did some benchmarks using 
> es_rally and results look good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap

2019-01-11 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740757#comment-16740757
 ] 

Michael McCandless commented on LUCENE-8635:


Wow, this is impressive!  Surprising how small the change was – basically 
opening up the FST BytesStore API a bit so that we could have an impl that 
wraps an {{IndexInput}} (reading backwards) instead of a {{byte[]}}.

Can you copy/paste the rally results out of Excel here?  I'm curious what 
search-time impact you're seeing.  If it not too much of an impact maybe we 
should consider just moving FSTs off-heap in the default codec?  We've done 
similar things recently for Lucene ... e.g. moving norms off heap.

I'll run Lucene's wikipedia benchmarks to measure the impact from our standard 
benchmarks (the nightly Lucene benchmarks).

> Lazy loading Lucene FST offheap using mmap
> --
>
> Key: LUCENE-8635
> URL: https://issues.apache.org/jira/browse/LUCENE-8635
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/FSTs
> Environment: I used below setup for es_rally tests:
> single node i3.xlarge running ES 6.5
> es_rally was running on another i3.xlarge instance
>Reporter: Ankit Jain
>Priority: Major
> Attachments: offheap.patch, rally_benchmark.xlsx
>
>
> Currently, FST loads all the terms into heap memory during index open. This 
> causes frequent JVM OOM issues if the term size gets big. A better way of 
> doing this will be to lazily load FST using mmap. That ensures only the 
> required terms get loaded into memory.
>  
> Lucene can expose API for providing list of fields to load terms offheap. I'm 
> planning to take following approach for this:
>  # Add a boolean property fstOffHeap in FieldInfo
>  # Pass list of offheap fields to lucene during index open (ALL can be 
> special keyword for loading ALL fields offheap)
>  # Initialize the fstOffHeap property during lucene index open
>  # FieldReader invokes default FST constructor or OffHeap constructor based 
> on fstOffHeap field
>  
> I created a patch (that loads all fields offheap), did some benchmarks using 
> es_rally and results look good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8601) Adding attributes to IndexFieldType

2019-01-04 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734103#comment-16734103
 ] 

Michael McCandless commented on LUCENE-8601:


Thanks [~muralikpbhat] – I'll review and push to 7.x!

> Adding attributes to IndexFieldType
> ---
>
> Key: LUCENE-8601
> URL: https://issues.apache.org/jira/browse/LUCENE-8601
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 7.5
>Reporter: Murali Krishna P
>Priority: Major
> Attachments: 7x_LUCENE-8601.06.patch, LUCENE-8601.01.patch, 
> LUCENE-8601.02.patch, LUCENE-8601.03.patch, LUCENE-8601.04.patch, 
> LUCENE-8601.05.patch, LUCENE-8601.06.patch, LUCENE-8601.patch
>
>
> Today, we can write a custom Field using custom IndexFieldType, but when the 
> DefaultIndexingChain converts [IndexFieldType to 
> FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L662],
>  only few key informations such as indexing options and doc value type are 
> retained. The [Codec gets the 
> FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/DocValuesConsumer.java#L90],
>  but not the type details.
>   
>  FieldInfo has support for ['attributes'| 
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/FieldInfo.java#L47]
>  and it would be great if we can add 'attributes' to IndexFieldType also and 
> copy it to FieldInfo's 'attribute'.
>   
>  This would allow someone to write a custom codec (extending docvalueformat 
> for example) for only the 'special field' that he wants and delegate the rest 
> of the fields to the default codec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8601) Adding attributes to IndexFieldType

2019-01-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733641#comment-16733641
 ] 

Michael McCandless commented on LUCENE-8601:


Hi [~muralikpbhat], I pushed the change to master, thanks!

But the {{git cherry-pick}} back to 7.x was not clean – could you fixup the 
patch to apply to 7.x as well?  Also, the test case uses FieldInfos API that 
was never back-ported to 7.x ({{getMergedFieldInfos}}).

Also, staring at the code shortly after I pushed I noticed that the field 
type's attributes will be saved into FieldInfo the first time that field is 
seen for a given segment, but subsequent times it looks like we will fail to 
copy the attributes again?  Can you also add a test case exposing this bug, and 
then fixing it?  We can do that on a follow-on issue ... thanks!

> Adding attributes to IndexFieldType
> ---
>
> Key: LUCENE-8601
> URL: https://issues.apache.org/jira/browse/LUCENE-8601
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 7.5
>Reporter: Murali Krishna P
>Priority: Major
> Attachments: LUCENE-8601.01.patch, LUCENE-8601.02.patch, 
> LUCENE-8601.03.patch, LUCENE-8601.04.patch, LUCENE-8601.05.patch, 
> LUCENE-8601.06.patch, LUCENE-8601.patch
>
>
> Today, we can write a custom Field using custom IndexFieldType, but when the 
> DefaultIndexingChain converts [IndexFieldType to 
> FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L662],
>  only few key informations such as indexing options and doc value type are 
> retained. The [Codec gets the 
> FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/DocValuesConsumer.java#L90],
>  but not the type details.
>   
>  FieldInfo has support for ['attributes'| 
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/FieldInfo.java#L47]
>  and it would be great if we can add 'attributes' to IndexFieldType also and 
> copy it to FieldInfo's 'attribute'.
>   
>  This would allow someone to write a custom codec (extending docvalueformat 
> for example) for only the 'special field' that he wants and delegate the rest 
> of the fields to the default codec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8621) Move LatLonShape out of sandbox

2019-01-01 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16731697#comment-16731697
 ] 

Michael McCandless commented on LUCENE-8621:


+1

> Move LatLonShape out of sandbox
> ---
>
> Key: LUCENE-8621
> URL: https://issues.apache.org/jira/browse/LUCENE-8621
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> LatLonShape has matured a lot over the last months, I'd like to start 
> thinking about moving it out of sandbox so that it doesn't stay there for too 
> long like what happened to LatLonPoint. I am pretty happy with the current 
> encoding. To my knowledge, we might just need to do a minor modification 
> because of 
> LUCENE-8620.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8601) Adding attributes to IndexFieldType

2019-01-01 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16731696#comment-16731696
 ] 

Michael McCandless commented on LUCENE-8601:


Thanks I will review and push soon!

> Adding attributes to IndexFieldType
> ---
>
> Key: LUCENE-8601
> URL: https://issues.apache.org/jira/browse/LUCENE-8601
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 7.5
>Reporter: Murali Krishna P
>Priority: Major
> Attachments: LUCENE-8601.01.patch, LUCENE-8601.02.patch, 
> LUCENE-8601.03.patch, LUCENE-8601.04.patch, LUCENE-8601.05.patch, 
> LUCENE-8601.06.patch, LUCENE-8601.patch
>
>
> Today, we can write a custom Field using custom IndexFieldType, but when the 
> DefaultIndexingChain converts [IndexFieldType to 
> FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L662],
>  only few key informations such as indexing options and doc value type are 
> retained. The [Codec gets the 
> FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/DocValuesConsumer.java#L90],
>  but not the type details.
>   
>  FieldInfo has support for ['attributes'| 
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/FieldInfo.java#L47]
>  and it would be great if we can add 'attributes' to IndexFieldType also and 
> copy it to FieldInfo's 'attribute'.
>   
>  This would allow someone to write a custom codec (extending docvalueformat 
> for example) for only the 'special field' that he wants and delegate the rest 
> of the fields to the default codec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8601) Adding attributes to IndexFieldType

2018-12-31 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16731288#comment-16731288
 ] 

Michael McCandless commented on LUCENE-8601:


Ahh OK thanks [~muralikpbhat]; that makes sense, so let's leave the assertion 
out.

> Adding attributes to IndexFieldType
> ---
>
> Key: LUCENE-8601
> URL: https://issues.apache.org/jira/browse/LUCENE-8601
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 7.5
>Reporter: Murali Krishna P
>Priority: Major
> Attachments: LUCENE-8601.01.patch, LUCENE-8601.02.patch, 
> LUCENE-8601.03.patch, LUCENE-8601.04.patch, LUCENE-8601.05.patch, 
> LUCENE-8601.patch
>
>
> Today, we can write a custom Field using custom IndexFieldType, but when the 
> DefaultIndexingChain converts [IndexFieldType to 
> FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L662],
>  only few key informations such as indexing options and doc value type are 
> retained. The [Codec gets the 
> FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/DocValuesConsumer.java#L90],
>  but not the type details.
>   
>  FieldInfo has support for ['attributes'| 
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/FieldInfo.java#L47]
>  and it would be great if we can add 'attributes' to IndexFieldType also and 
> copy it to FieldInfo's 'attribute'.
>   
>  This would allow someone to write a custom codec (extending docvalueformat 
> for example) for only the 'special field' that he wants and delegate the rest 
> of the fields to the default codec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8601) Adding attributes to IndexFieldType

2018-12-28 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16730210#comment-16730210
 ] 

Michael McCandless commented on LUCENE-8601:


Thanks [~muralikpbhat]!

Maybe add javadocs about {{ignoreCurrentFormat}} parameter?

Can you use multi-line {{if}} statement instead of ternary operator?

I think the changes to {{PerFieldPostingsFormat}} are OK, except  instead of 
removing the check that there was no format there and blindly overwriting it, 
can you change that to check that either it wasn't there (what it checks now) 
or, if it is there, that the value for the attributes match what that postings 
format wants to write?

No need to initialize class members with {{= null}}; that's already the default 
for java.

In {{DefaultIndexingChain}} can you use a local variable for the 
{{fieldType.getAttributes()}} in the two places where you reference it?

> Adding attributes to IndexFieldType
> ---
>
> Key: LUCENE-8601
> URL: https://issues.apache.org/jira/browse/LUCENE-8601
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 7.5
>Reporter: Murali Krishna P
>Priority: Major
> Attachments: LUCENE-8601.01.patch, LUCENE-8601.02.patch, 
> LUCENE-8601.03.patch, LUCENE-8601.04.patch, LUCENE-8601.05.patch, 
> LUCENE-8601.patch
>
>
> Today, we can write a custom Field using custom IndexFieldType, but when the 
> DefaultIndexingChain converts [IndexFieldType to 
> FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L662],
>  only few key informations such as indexing options and doc value type are 
> retained. The [Codec gets the 
> FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/DocValuesConsumer.java#L90],
>  but not the type details.
>   
>  FieldInfo has support for ['attributes'| 
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/FieldInfo.java#L47]
>  and it would be great if we can add 'attributes' to IndexFieldType also and 
> copy it to FieldInfo's 'attribute'.
>   
>  This would allow someone to write a custom codec (extending docvalueformat 
> for example) for only the 'special field' that he wants and delegate the rest 
> of the fields to the default codec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8601) Adding attributes to IndexFieldType

2018-12-19 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725297#comment-16725297
 ] 

Michael McCandless commented on LUCENE-8601:


Hmm, I'm concerned that segment merging may not preserve the attributes.

[~muralikpbhat] could you please add a test case that forces merging?  E.g. 
index one document with attributes, commit (so it writes a segment), index 
another without attributes, commit, and confirm attributes survived?

Can you also update the javadocs to state that if you try to index conflicting 
attributes the behavior is undefined (i.e. which attribute wins is undefined).

> Adding attributes to IndexFieldType
> ---
>
> Key: LUCENE-8601
> URL: https://issues.apache.org/jira/browse/LUCENE-8601
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 7.5
>Reporter: Murali Krishna P
>Priority: Major
> Attachments: LUCENE-8601.01.patch, LUCENE-8601.02.patch, 
> LUCENE-8601.03.patch, LUCENE-8601.04.patch, LUCENE-8601.patch
>
>
> Today, we can write a custom Field using custom IndexFieldType, but when the 
> DefaultIndexingChain converts [IndexFieldType to 
> FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L662],
>  only few key informations such as indexing options and doc value type are 
> retained. The [Codec gets the 
> FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/DocValuesConsumer.java#L90],
>  but not the type details.
>   
>  FieldInfo has support for ['attributes'| 
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/FieldInfo.java#L47]
>  and it would be great if we can add 'attributes' to IndexFieldType also and 
> copy it to FieldInfo's 'attribute'.
>   
>  This would allow someone to write a custom codec (extending docvalueformat 
> for example) for only the 'special field' that he wants and delegate the rest 
> of the fields to the default codec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   3   4   5   6   7   8   9   10   >