[jira] [Commented] (LUCENE-8041) All Fields.terms(fld) impls should be O(1) not O(log(N))

2019-12-03 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986764#comment-16986764
 ] 

Bruno Roustant commented on LUCENE-8041:


I created LUCENE-9078 "Term vectors options should not be configurable per-doc".

[~dsmiley] should we close this one?

> All Fields.terms(fld) impls should be O(1) not O(log(N))
> 
>
> Key: LUCENE-8041
> URL: https://issues.apache.org/jira/browse/LUCENE-8041
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8041-LinkedHashMap.patch, LUCENE-8041.patch
>
>
> I've seen apps that have a good number of fields -- hundreds.  The O(log(N)) 
> of TreeMap definitely shows up in a profiler; sometimes 20% of search time, 
> if I recall.  There are many Field implementations that are impacted... in 
> part because Fields is the base class of FieldsProducer.  
> As an aside, I hope Fields to go away some day; FieldsProducer should be 
> TermsProducer and not have an iterator of fields. If DocValuesProducer 
> doesn't have this then why should the terms index part of our API have it?  
> If we did this then the issue here would be a simple transition to a HashMap.
> Or maybe we can switch to HashMap and relax the definition of Fields.iterator 
> to not necessarily be sorted?
> Perhaps the fix can be a relatively simple conversion over to LinkedHashMap 
> in many cases if we can assume when we initialize these internal maps that we 
> consume them in sorted order to begin with.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8041) All Fields.terms(fld) impls should be O(1) not O(log(N))

2019-12-02 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986153#comment-16986153
 ] 

Bruno Roustant commented on LUCENE-8041:


I'm sorry I created and worked on a kind of duplicate Jira issue LUCENE-9045 
(now linked to this one as a child). I just heard about this one now.

The mentioned Jira issue fixed the problem for BlockTree and 
PerFieldPostingsFormat only.

I read in the thread that we should work on making term vectors consistent 
across the index. Should I create another Jira issue specific to that (and 
close this one as dupicate)? Or should I keep this one and maybe rename it?

> All Fields.terms(fld) impls should be O(1) not O(log(N))
> 
>
> Key: LUCENE-8041
> URL: https://issues.apache.org/jira/browse/LUCENE-8041
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8041-LinkedHashMap.patch, LUCENE-8041.patch
>
>
> I've seen apps that have a good number of fields -- hundreds.  The O(log(N)) 
> of TreeMap definitely shows up in a profiler; sometimes 20% of search time, 
> if I recall.  There are many Field implementations that are impacted... in 
> part because Fields is the base class of FieldsProducer.  
> As an aside, I hope Fields to go away some day; FieldsProducer should be 
> TermsProducer and not have an iterator of fields. If DocValuesProducer 
> doesn't have this then why should the terms index part of our API have it?  
> If we did this then the issue here would be a simple transition to a HashMap.
> Or maybe we can switch to HashMap and relax the definition of Fields.iterator 
> to not necessarily be sorted?
> Perhaps the fix can be a relatively simple conversion over to LinkedHashMap 
> in many cases if we can assume when we initialize these internal maps that we 
> consume them in sorted order to begin with.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8041) All Fields.terms(fld) impls should be O(1) not O(log(N))

2019-10-17 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954040#comment-16954040
 ] 

David Wayne Smiley commented on LUCENE-8041:


My recommendation is not to use a TreeMap at all.  Use a plain HashMap.  In the 
constructor, save away a Iterable using a method such as the following 
(taken from a fork of Lucene at work):
{code:java}
private static Iterable sortedFieldNames(Collection 
unsortedFields) {
  List fieldsNames = new ArrayList<>(unsortedFields);
  Collections.sort(fieldsNames);
  return Collections.unmodifiableCollection(fieldsNames);
}
{code}

You could just do this for PerFieldPostingsFormat's reader and 
Lucene50PostingsReader as these are the common ones, or extend this idea to 
others if you wish.  UniformSplit is already doing this approach.  Maybe just 
do those 2 up front for code review.  You could put these few lines of code 
into the affected files, or consider adding this as a protected method on 
Fields.

One day we can get rid of iterator() and that day it'll be less change if we 
use a HashMap for the fields, which is what we'll want then.

> All Fields.terms(fld) impls should be O(1) not O(log(N))
> 
>
> Key: LUCENE-8041
> URL: https://issues.apache.org/jira/browse/LUCENE-8041
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: David Wayne Smiley
>Priority: Major
> Attachments: LUCENE-8041-LinkedHashMap.patch, LUCENE-8041.patch
>
>
> I've seen apps that have a good number of fields -- hundreds.  The O(log(N)) 
> of TreeMap definitely shows up in a profiler; sometimes 20% of search time, 
> if I recall.  There are many Field implementations that are impacted... in 
> part because Fields is the base class of FieldsProducer.  
> As an aside, I hope Fields to go away some day; FieldsProducer should be 
> TermsProducer and not have an iterator of fields. If DocValuesProducer 
> doesn't have this then why should the terms index part of our API have it?  
> If we did this then the issue here would be a simple transition to a HashMap.
> Or maybe we can switch to HashMap and relax the definition of Fields.iterator 
> to not necessarily be sorted?
> Perhaps the fix can be a relatively simple conversion over to LinkedHashMap 
> in many cases if we can assume when we initialize these internal maps that we 
> consume them in sorted order to begin with.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8041) All Fields.terms(fld) impls should be O(1) not O(log(N))

2019-10-15 Thread Huy Le (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952483#comment-16952483
 ] 

Huy Le commented on LUCENE-8041:


[~dsmiley] is there anyway we can move forward with this ticket ?

> All Fields.terms(fld) impls should be O(1) not O(log(N))
> 
>
> Key: LUCENE-8041
> URL: https://issues.apache.org/jira/browse/LUCENE-8041
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: David Wayne Smiley
>Priority: Major
> Attachments: LUCENE-8041-LinkedHashMap.patch, LUCENE-8041.patch
>
>
> I've seen apps that have a good number of fields -- hundreds.  The O(log(N)) 
> of TreeMap definitely shows up in a profiler; sometimes 20% of search time, 
> if I recall.  There are many Field implementations that are impacted... in 
> part because Fields is the base class of FieldsProducer.  
> As an aside, I hope Fields to go away some day; FieldsProducer should be 
> TermsProducer and not have an iterator of fields. If DocValuesProducer 
> doesn't have this then why should the terms index part of our API have it?  
> If we did this then the issue here would be a simple transition to a HashMap.
> Or maybe we can switch to HashMap and relax the definition of Fields.iterator 
> to not necessarily be sorted?
> Perhaps the fix can be a relatively simple conversion over to LinkedHashMap 
> in many cases if we can assume when we initialize these internal maps that we 
> consume them in sorted order to begin with.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org