[jira] [Commented] (LUCENE-9815) PerField formats can select the format based on FieldInfo

2021-03-01 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292768#comment-17292768
 ] 

Bruno Roustant commented on LUCENE-9815:


+1 on LUCENE-9796
I'll close this PR and try to find some cycles to help on LUCENE-9796.

> PerField formats can select the format based on FieldInfo
> -
>
> Key: LUCENE-9815
> URL: https://issues.apache.org/jira/browse/LUCENE-9815
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
> Attachments: Screen_Shot_2021-02-28_at_16.08.05.png
>
>
> PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the 
> format based on the field name.
> If we improve them to also support the selection based on the FieldInfo, it 
> will be possible to select based on some FieldInfo attribute, DocValuesType, 
> etc.
> +Example use-case:+
>  It will be possible to adapt the compression mode of doc values fields 
> easily based on the DocValuesType. E.g. compressing sorted and not binary doc 
> values.
> > User creates a new custom codec which provides a custom DocValuesFormat 
> > which extends PerFieldDocValuesFormat and implements the method
>  DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo).
>  This method provides either a standard Lucene80DocValuesFormat (no 
> compression) or another new custom DocValuesFormat extending 
> Lucene80DocValuesFormat with BEST_COMPRESSION mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9815) PerField formats can select the format based on FieldInfo

2021-02-28 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292522#comment-17292522
 ] 

Robert Muir commented on LUCENE-9815:
-

And see the LUCENE-9796 issue to illuminate the broken algorithms using SORTED 
in the wrong way so that we can fix them one by one. I fixed checkindex already 
in LUCENE-9795, as you can see you just fix the inefficient consumer code and 
there is no perf issue:  !Screen_Shot_2021-02-28_at_16.08.05.png! 

So let's fix that stuff to make things fast, and so that we can always improve 
the compression. We don't need abstractions or options.

> PerField formats can select the format based on FieldInfo
> -
>
> Key: LUCENE-9815
> URL: https://issues.apache.org/jira/browse/LUCENE-9815
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
> Attachments: Screen_Shot_2021-02-28_at_16.08.05.png
>
>
> PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the 
> format based on the field name.
> If we improve them to also support the selection based on the FieldInfo, it 
> will be possible to select based on some FieldInfo attribute, DocValuesType, 
> etc.
> +Example use-case:+
>  It will be possible to adapt the compression mode of doc values fields 
> easily based on the DocValuesType. E.g. compressing sorted and not binary doc 
> values.
> > User creates a new custom codec which provides a custom DocValuesFormat 
> > which extends PerFieldDocValuesFormat and implements the method
>  DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo).
>  This method provides either a standard Lucene80DocValuesFormat (no 
> compression) or another new custom DocValuesFormat extending 
> Lucene80DocValuesFormat with BEST_COMPRESSION mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9815) PerField formats can select the format based on FieldInfo

2021-02-28 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292520#comment-17292520
 ] 

Robert Muir commented on LUCENE-9815:
-

Sorry, that comparison is apples and oranges. Any option should only apply to 
BINARY which is for the user to stuff what they want in there.

Sorted is not like that, you asked for the field to be duplicated and 
dereferenced by ordinals, that means you want ordinals. The algorithms should 
use the per-doc ordinals and then it doesn't hurt if we compress the terms dict 
a little bit more than we did before. Again: we always compressed it!

> PerField formats can select the format based on FieldInfo
> -
>
> Key: LUCENE-9815
> URL: https://issues.apache.org/jira/browse/LUCENE-9815
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
>
> PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the 
> format based on the field name.
> If we improve them to also support the selection based on the FieldInfo, it 
> will be possible to select based on some FieldInfo attribute, DocValuesType, 
> etc.
> +Example use-case:+
>  It will be possible to adapt the compression mode of doc values fields 
> easily based on the DocValuesType. E.g. compressing sorted and not binary doc 
> values.
> > User creates a new custom codec which provides a custom DocValuesFormat 
> > which extends PerFieldDocValuesFormat and implements the method
>  DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo).
>  This method provides either a standard Lucene80DocValuesFormat (no 
> compression) or another new custom DocValuesFormat extending 
> Lucene80DocValuesFormat with BEST_COMPRESSION mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9815) PerField formats can select the format based on FieldInfo

2021-02-28 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292518#comment-17292518
 ] 

Bruno Roustant commented on LUCENE-9815:


[~rcmuir] do you mean always compressing sorted docvalues and having a on/off 
mode for binary docvalues?
Based on LUCENE-9378 binary docvalues compression causes a big perf impact, so 
currently the on/off compression mode for all docvalues is not so useful as 
users do not want to hit the perf for binary docvalues so they don't enable 
compression for sorted set neither.

> PerField formats can select the format based on FieldInfo
> -
>
> Key: LUCENE-9815
> URL: https://issues.apache.org/jira/browse/LUCENE-9815
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
>
> PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the 
> format based on the field name.
> If we improve them to also support the selection based on the FieldInfo, it 
> will be possible to select based on some FieldInfo attribute, DocValuesType, 
> etc.
> +Example use-case:+
>  It will be possible to adapt the compression mode of doc values fields 
> easily based on the DocValuesType. E.g. compressing sorted and not binary doc 
> values.
> > User creates a new custom codec which provides a custom DocValuesFormat 
> > which extends PerFieldDocValuesFormat and implements the method
>  DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo).
>  This method provides either a standard Lucene80DocValuesFormat (no 
> compression) or another new custom DocValuesFormat extending 
> Lucene80DocValuesFormat with BEST_COMPRESSION mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9815) PerField formats can select the format based on FieldInfo

2021-02-28 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292426#comment-17292426
 ] 

Robert Muir commented on LUCENE-9815:
-

Instead of adding more complexity/abstractions, I would like to see a PR that 
just removes the compression option completely for Sorted term dictionaries.

Of course we compress the sorted term dictionaries. we were compressing them 
always before too (with delta encoding etc). Now they just use a little LZ4 as 
well. But there shouldn't be a config option around it, and then we don't need 
sophisticated per-field stuff  either.

> PerField formats can select the format based on FieldInfo
> -
>
> Key: LUCENE-9815
> URL: https://issues.apache.org/jira/browse/LUCENE-9815
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
>
> PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the 
> format based on the field name.
> If we improve them to also support the selection based on the FieldInfo, it 
> will be possible to select based on some FieldInfo attribute, DocValuesType, 
> etc.
> +Example use-case:+
>  It will be possible to adapt the compression mode of doc values fields 
> easily based on the DocValuesType. E.g. compressing sorted and not binary doc 
> values.
> > User creates a new custom codec which provides a custom DocValuesFormat 
> > which extends PerFieldDocValuesFormat and implements the method
>  DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo).
>  This method provides either a standard Lucene80DocValuesFormat (no 
> compression) or another new custom DocValuesFormat extending 
> Lucene80DocValuesFormat with BEST_COMPRESSION mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9815) PerField formats can select the format based on FieldInfo

2021-02-28 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292425#comment-17292425
 ] 

Robert Muir commented on LUCENE-9815:
-

I still think the use-case/premise here is wrong. We shouldn't even have it as 
an option to compress the sorted docvalues term dictionary, it should just be 
what happens.

And any bad code hashing on term bytes and stuff instead of using ordinals 
should just be fixed.

> PerField formats can select the format based on FieldInfo
> -
>
> Key: LUCENE-9815
> URL: https://issues.apache.org/jira/browse/LUCENE-9815
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
>
> PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the 
> format based on the field name.
> If we improve them to also support the selection based on the FieldInfo, it 
> will be possible to select based on some FieldInfo attribute, DocValuesType, 
> etc.
> +Example use-case:+
>  It will be possible to adapt the compression mode of doc values fields 
> easily based on the DocValuesType. E.g. compressing sorted and not binary doc 
> values.
> > User creates a new custom codec which provides a custom DocValuesFormat 
> > which extends PerFieldDocValuesFormat and implements the method
>  DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo).
>  This method provides either a standard Lucene80DocValuesFormat (no 
> compression) or another new custom DocValuesFormat extending 
> Lucene80DocValuesFormat with BEST_COMPRESSION mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org