[jira] [Commented] (LUCENE-9815) PerField formats can select the format based on FieldInfo
[ https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292768#comment-17292768 ] Bruno Roustant commented on LUCENE-9815: +1 on LUCENE-9796 I'll close this PR and try to find some cycles to help on LUCENE-9796. > PerField formats can select the format based on FieldInfo > - > > Key: LUCENE-9815 > URL: https://issues.apache.org/jira/browse/LUCENE-9815 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Minor > Attachments: Screen_Shot_2021-02-28_at_16.08.05.png > > > PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the > format based on the field name. > If we improve them to also support the selection based on the FieldInfo, it > will be possible to select based on some FieldInfo attribute, DocValuesType, > etc. > +Example use-case:+ > It will be possible to adapt the compression mode of doc values fields > easily based on the DocValuesType. E.g. compressing sorted and not binary doc > values. > > User creates a new custom codec which provides a custom DocValuesFormat > > which extends PerFieldDocValuesFormat and implements the method > DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo). > This method provides either a standard Lucene80DocValuesFormat (no > compression) or another new custom DocValuesFormat extending > Lucene80DocValuesFormat with BEST_COMPRESSION mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9815) PerField formats can select the format based on FieldInfo
[ https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292522#comment-17292522 ] Robert Muir commented on LUCENE-9815: - And see the LUCENE-9796 issue to illuminate the broken algorithms using SORTED in the wrong way so that we can fix them one by one. I fixed checkindex already in LUCENE-9795, as you can see you just fix the inefficient consumer code and there is no perf issue: !Screen_Shot_2021-02-28_at_16.08.05.png! So let's fix that stuff to make things fast, and so that we can always improve the compression. We don't need abstractions or options. > PerField formats can select the format based on FieldInfo > - > > Key: LUCENE-9815 > URL: https://issues.apache.org/jira/browse/LUCENE-9815 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Minor > Attachments: Screen_Shot_2021-02-28_at_16.08.05.png > > > PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the > format based on the field name. > If we improve them to also support the selection based on the FieldInfo, it > will be possible to select based on some FieldInfo attribute, DocValuesType, > etc. > +Example use-case:+ > It will be possible to adapt the compression mode of doc values fields > easily based on the DocValuesType. E.g. compressing sorted and not binary doc > values. > > User creates a new custom codec which provides a custom DocValuesFormat > > which extends PerFieldDocValuesFormat and implements the method > DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo). > This method provides either a standard Lucene80DocValuesFormat (no > compression) or another new custom DocValuesFormat extending > Lucene80DocValuesFormat with BEST_COMPRESSION mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9815) PerField formats can select the format based on FieldInfo
[ https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292520#comment-17292520 ] Robert Muir commented on LUCENE-9815: - Sorry, that comparison is apples and oranges. Any option should only apply to BINARY which is for the user to stuff what they want in there. Sorted is not like that, you asked for the field to be duplicated and dereferenced by ordinals, that means you want ordinals. The algorithms should use the per-doc ordinals and then it doesn't hurt if we compress the terms dict a little bit more than we did before. Again: we always compressed it! > PerField formats can select the format based on FieldInfo > - > > Key: LUCENE-9815 > URL: https://issues.apache.org/jira/browse/LUCENE-9815 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Minor > > PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the > format based on the field name. > If we improve them to also support the selection based on the FieldInfo, it > will be possible to select based on some FieldInfo attribute, DocValuesType, > etc. > +Example use-case:+ > It will be possible to adapt the compression mode of doc values fields > easily based on the DocValuesType. E.g. compressing sorted and not binary doc > values. > > User creates a new custom codec which provides a custom DocValuesFormat > > which extends PerFieldDocValuesFormat and implements the method > DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo). > This method provides either a standard Lucene80DocValuesFormat (no > compression) or another new custom DocValuesFormat extending > Lucene80DocValuesFormat with BEST_COMPRESSION mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9815) PerField formats can select the format based on FieldInfo
[ https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292518#comment-17292518 ] Bruno Roustant commented on LUCENE-9815: [~rcmuir] do you mean always compressing sorted docvalues and having a on/off mode for binary docvalues? Based on LUCENE-9378 binary docvalues compression causes a big perf impact, so currently the on/off compression mode for all docvalues is not so useful as users do not want to hit the perf for binary docvalues so they don't enable compression for sorted set neither. > PerField formats can select the format based on FieldInfo > - > > Key: LUCENE-9815 > URL: https://issues.apache.org/jira/browse/LUCENE-9815 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Minor > > PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the > format based on the field name. > If we improve them to also support the selection based on the FieldInfo, it > will be possible to select based on some FieldInfo attribute, DocValuesType, > etc. > +Example use-case:+ > It will be possible to adapt the compression mode of doc values fields > easily based on the DocValuesType. E.g. compressing sorted and not binary doc > values. > > User creates a new custom codec which provides a custom DocValuesFormat > > which extends PerFieldDocValuesFormat and implements the method > DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo). > This method provides either a standard Lucene80DocValuesFormat (no > compression) or another new custom DocValuesFormat extending > Lucene80DocValuesFormat with BEST_COMPRESSION mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9815) PerField formats can select the format based on FieldInfo
[ https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292426#comment-17292426 ] Robert Muir commented on LUCENE-9815: - Instead of adding more complexity/abstractions, I would like to see a PR that just removes the compression option completely for Sorted term dictionaries. Of course we compress the sorted term dictionaries. we were compressing them always before too (with delta encoding etc). Now they just use a little LZ4 as well. But there shouldn't be a config option around it, and then we don't need sophisticated per-field stuff either. > PerField formats can select the format based on FieldInfo > - > > Key: LUCENE-9815 > URL: https://issues.apache.org/jira/browse/LUCENE-9815 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Minor > > PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the > format based on the field name. > If we improve them to also support the selection based on the FieldInfo, it > will be possible to select based on some FieldInfo attribute, DocValuesType, > etc. > +Example use-case:+ > It will be possible to adapt the compression mode of doc values fields > easily based on the DocValuesType. E.g. compressing sorted and not binary doc > values. > > User creates a new custom codec which provides a custom DocValuesFormat > > which extends PerFieldDocValuesFormat and implements the method > DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo). > This method provides either a standard Lucene80DocValuesFormat (no > compression) or another new custom DocValuesFormat extending > Lucene80DocValuesFormat with BEST_COMPRESSION mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9815) PerField formats can select the format based on FieldInfo
[ https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292425#comment-17292425 ] Robert Muir commented on LUCENE-9815: - I still think the use-case/premise here is wrong. We shouldn't even have it as an option to compress the sorted docvalues term dictionary, it should just be what happens. And any bad code hashing on term bytes and stuff instead of using ordinals should just be fixed. > PerField formats can select the format based on FieldInfo > - > > Key: LUCENE-9815 > URL: https://issues.apache.org/jira/browse/LUCENE-9815 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Minor > > PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the > format based on the field name. > If we improve them to also support the selection based on the FieldInfo, it > will be possible to select based on some FieldInfo attribute, DocValuesType, > etc. > +Example use-case:+ > It will be possible to adapt the compression mode of doc values fields > easily based on the DocValuesType. E.g. compressing sorted and not binary doc > values. > > User creates a new custom codec which provides a custom DocValuesFormat > > which extends PerFieldDocValuesFormat and implements the method > DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo). > This method provides either a standard Lucene80DocValuesFormat (no > compression) or another new custom DocValuesFormat extending > Lucene80DocValuesFormat with BEST_COMPRESSION mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org