[jira] [Commented] (LUCENE-10054) Handle hierarchy in HNSW graph

2022-07-26 Thread Mike Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571588#comment-17571588
 ] 

Mike Sokolov commented on LUCENE-10054:
---

what is it with this issue that spammers love so much!? I wonder if we
could somehow lock it as read-only ...



> Handle hierarchy in HNSW graph
> --
>
> Key: LUCENE-10054
> URL: https://issues.apache.org/jira/browse/LUCENE-10054
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Mayya Sharipova
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.1
>
>  Time Spent: 20h 20m
>  Remaining Estimate: 0h
>
> Currently HNSW graph is represented as a single layer graph. 
>  We would like to extend it to handle hierarchy as per 
> [discussion|https://issues.apache.org/jira/browse/LUCENE-9004?focusedCommentId=17393216=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17393216].
>  
>  
> TODO tasks:
> - add multiple layers in the HnswGraph class
>  - modify the format in  Lucene90HnswVectorsWriter and 
> Lucene90HnswVectorsReader to handle multiple layers
> - modify graph construction and search algorithm to handle hierarchy
>  - run benchmarks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10335) IOUtils.getDecodingReader(Class, String) is broken with modules

2022-05-04 Thread Mike Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531926#comment-17531926
 ] 

Mike Sokolov commented on LUCENE-10335:
---

sure - I opened https://issues.apache.org/jira/browse/LUCENE-10558



> IOUtils.getDecodingReader(Class, String) is broken with modules
> --
>
> Key: LUCENE-10335
> URL: https://issues.apache.org/jira/browse/LUCENE-10335
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 9.1, 10.0 (main)
>
> Attachments: LUCENE-10335-1.patch, LUCENE-10335.patch, Screenshot 
> from 2021-12-25 18-04-55.png
>
>  Time Spent: 18h 40m
>  Remaining Estimate: 0h
>
> This method calls clazz.getResourceAsStream() but in a modular application 
> the method won't see any of the resources in clazz's module, causing an NPE. 
> We should deprecate or even remove this method entirely, leaving only 
> getDecodingReader(InputStream) and opening the resource on the caller's side.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8981) Update javadocs to reflect experimental status of Kuromoji DictionaryBuilder

2019-09-17 Thread Mike Sokolov (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov resolved LUCENE-8981.
--
Resolution: Fixed

> Update javadocs to reflect experimental status of Kuromoji DictionaryBuilder
> 
>
> Key: LUCENE-8981
> URL: https://issues.apache.org/jira/browse/LUCENE-8981
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Mike Sokolov
>Priority: Minor
> Fix For: 8.3
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This is follow up to LUCENE-8971



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8981) Update javadocs to reflect experimental status of Kuromoji DictionaryBuilder

2019-09-17 Thread Mike Sokolov (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-8981:
-
Fix Version/s: 8.3

> Update javadocs to reflect experimental status of Kuromoji DictionaryBuilder
> 
>
> Key: LUCENE-8981
> URL: https://issues.apache.org/jira/browse/LUCENE-8981
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Mike Sokolov
>Priority: Minor
> Fix For: 8.3
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This is follow up to LUCENE-8971



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-17 Thread Mike Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931367#comment-16931367
 ] 

Mike Sokolov edited comment on LUCENE-8920 at 9/17/19 12:06 PM:


This is cool. Regarding the strategy for which encoding to apply, I'll just 
call out the current  heuristics:

{{doFixedArray = node depth (distance from root node) <= 3 and N >= 5) or N >= 
10}}

{{writeDirectly = doFixedArray && (max_label - min_label) < 4 * N}}

 

{{I think we can still consider that we would apply list-encoding for small N, 
and consider open addressing as a variant within "doFixedArray," where we now 
can choose among direct addressing (for least load factors L), open addressing 
(for intermediate case), and binary search for highest L. Does that sound 
right?}}

 

{{I wonder if we could work backwards from a single parameter L: maximum memory 
cost (vs list encoding). The API would guarantee that no set of arcs is ever 
encoded using more than L * the minimum possible, and then internally we choose 
the best (ie fastest lookup) encoding that achieves that, possibly with some 
tweak for higher-order arcs (ie near the root).}}


was (Author: sokolov):
This is cool. Regarding the strategy for which encoding to apply, I'll just 
call out the current  heuristics:


{{doFixedArray = }}{{(node depth (distance from root node) <= 3 and N >= 5) or 
N >= 10}}

{{writeDirectly = doFixedArray && (max_label - min_label) < 4 * N}}

 

{{I think we can still consider that we would apply list-encoding for small N, 
and consider open addressing as a variant within "doFixedArray," where we now 
can choose among direct addressing (for least load factors L), open addressing 
(for intermediate case), and binary search for highest L. Does that sound 
right?}}

 

{{I wonder if we could work backwards from a single parameter L: maximum memory 
cost (vs list encoding). The API would guarantee that no set of arcs is ever 
encoded using more than L * the minimum possible, and then internally we choose 
the best (ie fastest lookup) encoding that achieves that, possibly with some 
tweak for higher-order arcs (ie near the root).}}

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-17 Thread Mike Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931367#comment-16931367
 ] 

Mike Sokolov commented on LUCENE-8920:
--

This is cool. Regarding the strategy for which encoding to apply, I'll just 
call out the current  heuristics:


{{doFixedArray = }}{{(node depth (distance from root node) <= 3 and N >= 5) or 
N >= 10}}

{{writeDirectly = doFixedArray && (max_label - min_label) < 4 * N}}

 

{{I think we can still consider that we would apply list-encoding for small N, 
and consider open addressing as a variant within "doFixedArray," where we now 
can choose among direct addressing (for least load factors L), open addressing 
(for intermediate case), and binary search for highest L. Does that sound 
right?}}

 

{{I wonder if we could work backwards from a single parameter L: maximum memory 
cost (vs list encoding). The API would guarantee that no set of arcs is ever 
encoded using more than L * the minimum possible, and then internally we choose 
the best (ie fastest lookup) encoding that achieves that, possibly with some 
tweak for higher-order arcs (ie near the root).}}

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-8981) Update javadocs to reflect experimental status of Kuromoji DictionaryBuilder

2019-09-16 Thread Mike Sokolov (Jira)
Mike Sokolov created LUCENE-8981:


 Summary: Update javadocs to reflect experimental status of 
Kuromoji DictionaryBuilder
 Key: LUCENE-8981
 URL: https://issues.apache.org/jira/browse/LUCENE-8981
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Mike Sokolov


This is follow up to LUCENE-8971



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org