[jira] [Commented] (LUCENE-8945) Allow to change the output file delimiter on Luke "export terms" feature
[ https://issues.apache.org/jira/browse/LUCENE-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932147#comment-16932147 ] Tomoko Uchida commented on LUCENE-8945: --- +1 I will commit it to the ASF repo in shortly. > Allow to change the output file delimiter on Luke "export terms" feature > > > Key: LUCENE-8945 > URL: https://issues.apache.org/jira/browse/LUCENE-8945 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/luke >Reporter: Tomoko Uchida >Priority: Minor > Attachments: LUCENE-8945.patch, LUCENE-8945.patch, > delimiter_comma_exported_file.PNG, delimiter_space_exported_file.PNG, > delimiter_tab_exported_file.PNG, luke_export_delimiter.png > > > This is a follow-up issue for LUCENE-8764. > Current delimiter is fixed to "," (comma), but terms also can include comma > and they are not escaped. It would be better if the delimiter can be > changed/selected to a tab or whitespace when exporting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8945) Allow to change the output file delimiter on Luke "export terms" feature
[ https://issues.apache.org/jira/browse/LUCENE-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-8945: -- Attachment: LUCENE-8945-final.patch > Allow to change the output file delimiter on Luke "export terms" feature > > > Key: LUCENE-8945 > URL: https://issues.apache.org/jira/browse/LUCENE-8945 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/luke >Reporter: Tomoko Uchida >Priority: Minor > Attachments: LUCENE-8945-final.patch, LUCENE-8945.patch, > LUCENE-8945.patch, delimiter_comma_exported_file.PNG, > delimiter_space_exported_file.PNG, delimiter_tab_exported_file.PNG, > luke_export_delimiter.png > > > This is a follow-up issue for LUCENE-8764. > Current delimiter is fixed to "," (comma), but terms also can include comma > and they are not escaped. It would be better if the delimiter can be > changed/selected to a tab or whitespace when exporting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8945) Allow to change the output file delimiter on Luke "export terms" feature
[ https://issues.apache.org/jira/browse/LUCENE-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932320#comment-16932320 ] Tomoko Uchida commented on LUCENE-8945: --- Seems that ASF bot is not working... Committed to the master and 8x, with slight modification (moved Delimiter to private enum, it's used only in the factory anyway). Here is the final patch [^LUCENE-8945-final.patch] [https://github.com/apache/lucene-solr/commit/369df12c2cc54e929bd25dd77424242ddd0fb047] Thanks [~shahamish150294]! > Allow to change the output file delimiter on Luke "export terms" feature > > > Key: LUCENE-8945 > URL: https://issues.apache.org/jira/browse/LUCENE-8945 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/luke >Reporter: Tomoko Uchida >Priority: Minor > Attachments: LUCENE-8945-final.patch, LUCENE-8945.patch, > LUCENE-8945.patch, delimiter_comma_exported_file.PNG, > delimiter_space_exported_file.PNG, delimiter_tab_exported_file.PNG, > luke_export_delimiter.png > > > This is a follow-up issue for LUCENE-8764. > Current delimiter is fixed to "," (comma), but terms also can include comma > and they are not escaped. It would be better if the delimiter can be > changed/selected to a tab or whitespace when exporting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8945) Allow to change the output file delimiter on Luke "export terms" feature
[ https://issues.apache.org/jira/browse/LUCENE-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida resolved LUCENE-8945. --- Fix Version/s: 8.3 master (9.0) Assignee: Tomoko Uchida Resolution: Fixed > Allow to change the output file delimiter on Luke "export terms" feature > > > Key: LUCENE-8945 > URL: https://issues.apache.org/jira/browse/LUCENE-8945 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/luke >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Minor > Fix For: master (9.0), 8.3 > > Attachments: LUCENE-8945-final.patch, LUCENE-8945.patch, > LUCENE-8945.patch, delimiter_comma_exported_file.PNG, > delimiter_space_exported_file.PNG, delimiter_tab_exported_file.PNG, > luke_export_delimiter.png > > > This is a follow-up issue for LUCENE-8764. > Current delimiter is fixed to "," (comma), but terms also can include comma > and they are not escaped. It would be better if the delimiter can be > changed/selected to a tab or whitespace when exporting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976542#comment-16976542 ] Tomoko Uchida commented on LUCENE-9004: --- Thanks for mentioning, I was working on this issue for couple of weeks and here is my WIP/PoC branch (actually it's not a PR, because "Query" part is still missing). [https://github.com/mocobeta/lucene-solr-mirror/commits/jira/LUCENE-9004-aknn] I borrowed [~sokolov]'s idea but took different implementation approach: - Introduce new codec (Format, Writer, and Reader) for the graph part. The new {{GraphFormat}} can express multi level (document) graph. - Introduce new doc values field type for the vector part. The new {{VectorDocValues}} shares the same codec to BinaryDocValues but provides special functionalities for dense vector handling: encode/decode float array to/from binary value, keep num of dimensions and distance function, and allow random access to underlying binary doc values. (For now I just reset IndexedDISI when seeking backwards.) It works but there are indexing performance concerns (due to costly graph construction). Anyway I hope I can create a PR with working examples before long... > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Attachments: hnsw_layered_graph.png > > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The implementation I propose for Lucene is as follows. We need two new data > structures to encode the vectors and the graph. We can encode vectors using a > light wrapper around {{BinaryDocValues}} (we also want to encode the vector > dimension and have efficient conversion from bytes to floats). For the graph > we can use {{SortedNumericDocValues}} where the values we encode are the > docids of the related documents. Encoding the interdocument relations using > docids directly will make it relatively fast to traverse the graph since we > won't need to lookup through an id-field indirection. This choice limits us > to building a graph-per-segment since it would be impractical to maintain a > global graph for the whole index in the face of segment merges. However > graph-per-segment is a very natural at search time - we can traverse each > segments' graph independently and merge results as we do today for term-based > search. > At index time, however, merging graphs is somewhat challenging. While > indexing we build a graph incrementally, performing searches to construct > links among neighbors. When merging segments we must construct a new graph > containing elements of all the merged segments. Ideally we would somehow > preserve the work done when building the initial graphs, but at least as a > start I'd propose we construct a new graph from scratch when merging. The > process is going to be limited, at least initially, to graphs that can fit > in RAM since we require random access to the entire graph while constructing > it: In order to add links bidirectionally we must continually update existing > documents. > I think we want to express this API to users as a single joint >
[jira] [Commented] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980772#comment-16980772 ] Tomoko Uchida commented on LUCENE-9004: --- Just for status update: [my PoC branch|https://github.com/mocobeta/lucene-solr-mirror/tree/jira/LUCENE-9004-aknn] is still on pretty early stage and works only on one segment, but now it can index and query arbitrary vectors by [this example code|https://gist.github.com/mocobeta/5c174ee9fc6408470057a9e7d2020c45]. The newly added KnnGraphQuery is an extension of Query class so it should be able to be combined with other queries with some limitations, because the knn query cannot score entire dataset in nature. Indexing performance is terrible for now (it takes a few minutes for a hundred of thousands vectors w/ 100 dims on commodity PCs), but searching doesn't look too bad (it takes ~30 msec for the same dataset) thanks to the skip list-like graph structure. On my current branch I wrapped {{BinaryDocValues}} to store vector values. However, exposing random access capability for doc values (or its extensions) can be controversial, so I'd like to propose a new codec which combines 1. the HNSW graph and 2. the vectors (float arrays). The new format for each vector field would have three parts (in other words, three files in a segment). They would look like: {code:java} Meta data and index part: +--+ | meta data| ++-+ | doc id | offset to first friend list for the doc | ++-+ | doc id | offset to first friend list for the doc | ++-+ | .. | ++-+ Graph data part: +-+---+-+-+ | friends list at layer N | friends list at layer N-1 | .. | friends list at level 0 | +-+---+-+-+ | friends list at layer N | friends list at layer N-1 | .. | friends list at level 0 | +-+---+-+-+ |.. | +-+ Vector data part: +--+ | encoded vector value | +--+ | encoded vector value | +--+ | .. | +--+ {code} - "meta data" includes: number of dimensions, distance function for similarity calculation, and other field level meta data - "doc id" is: doc ids having a vector value on this field - "friends list at layer N" is: a delta encoded target doc id list where each target doc is connected to the doc at Nth layer - "encoded vector value" is: a fixed length byte array. the offset of the value can be calculated on the fly. (limitations: each document can have only one vector value for each vector field) The graph data (friends lists) is relatively small so we could keep all of them on the Java heap for fast retrieval (though some off-heap strategy might be required for very large graphs). The vector data (vector values) is large and only the small fraction of it is needed when searching, so they should be accessed by on-demand style via the index. Feedback is welcomed. And I have a question about introducing new formats - is there a way to inject XXXFormat to the indexing chain so that we can add in this feature without any change on the {{lucene-core}}? > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Attachments: hnsw_layered_graph.png > > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable
[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980772#comment-16980772 ] Tomoko Uchida edited comment on LUCENE-9004 at 11/23/19 4:23 PM: - Just for status update: [my PoC branch|https://github.com/mocobeta/lucene-solr-mirror/tree/jira/LUCENE-9004-aknn] is still on pretty early stage and works only on one segment, but now it can index and query arbitrary vectors by [this example code|https://gist.github.com/mocobeta/5c174ee9fc6408470057a9e7d2020c45]. The newly added KnnGraphQuery is an extension of Query class so it should be able to be combined with other queries with some limitations, because the knn query cannot score entire dataset in nature. Indexing performance is terrible for now (it takes a few minutes for a hundred of thousands vectors w/ 100 dims on commodity PCs), but searching doesn't look too bad (it takes ~30 msec for the same dataset) thanks to the skip list-like graph structure. On my current branch I wrapped {{BinaryDocValues}} to store vector values. However, exposing random access capability for doc values (or its extensions) can be controversial, so I'd like to propose a new codec which combines 1. the HNSW graph and 2. the vectors (float arrays). The new format for each vector field would have three parts (in other words, three files in a segment). They would look like: {code:java} Meta data and index part: +--+ | meta data| ++-+ | doc id | offset to first friend list for the doc | ++-+ | doc id | offset to first friend list for the doc | ++-+ | .. | ++-+ Graph data part: +-+---+-+-+ | friends list at layer N | friends list at layer N-1 | .. | friends list at level 0 | <- friends lists for doc 0 +-+---+-+-+ | friends list at layer N | friends list at layer N-1 | .. | friends list at level 0 | <- friends lists for doc 1 +-+---+-+-+ |.. | <- and so on +-+ Vector data part: +--+ | encoded vector value | <- vector value for doc 0 +--+ | encoded vector value | <- vector value for doc 1 +--+ | .. | <- and so on +--+ {code} - "meta data" includes: number of dimensions, distance function for similarity calculation, and other field level meta data - "doc id" is: doc ids having a vector value on this field - "friends list at layer N" is: a delta encoded target doc id list where each target doc is connected to the doc at Nth layer - "encoded vector value" is: a fixed length byte array. the offset of the value can be calculated on the fly. (limitations: each document can have only one vector value for each vector field) The graph data (friends lists) is relatively small so we could keep all of them on the Java heap for fast retrieval (though some off-heap strategy might be required for very large graphs). The vector data (vector values) is large and only the small fraction of it is needed when searching, so they should be kept on disk and accessed by some on-demand style. Feedback is welcomed. And I have a question about introducing new formats - is there a way to inject XXXFormat to the indexing chain so that we can add in this feature without any change on the {{lucene-core}}? was (Author: tomoko uchida): Just for status update: [my PoC branch|https://github.com/mocobeta/lucene-solr-mirror/tree/jira/LUCENE-9004-aknn] is still on pretty early stage and works only on one segment, but now it can index and query arbitrary vectors by [this example code|https://gist.github.com/mocobeta/5c174ee9fc6408470057a9e7d2020c45]. The newly added KnnGraphQuery is an extension of Query class so it should be able to be combined with other queries with some limitations, because the knn query cannot score entire dataset in nature. Indexing performance is terrible for now (it takes a few minutes for a hundred of thousands vectors w/ 100 dims on commodity PCs), but searching doesn't look too bad (it takes ~30 msec for the same dataset) thanks to the skip list-like graph structure. On my current branch I wrapped {{BinaryDocValues}} to store vector
[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980772#comment-16980772 ] Tomoko Uchida edited comment on LUCENE-9004 at 11/23/19 3:52 PM: - Just for status update: [my PoC branch|https://github.com/mocobeta/lucene-solr-mirror/tree/jira/LUCENE-9004-aknn] is still on pretty early stage and works only on one segment, but now it can index and query arbitrary vectors by [this example code|https://gist.github.com/mocobeta/5c174ee9fc6408470057a9e7d2020c45]. The newly added KnnGraphQuery is an extension of Query class so it should be able to be combined with other queries with some limitations, because the knn query cannot score entire dataset in nature. Indexing performance is terrible for now (it takes a few minutes for a hundred of thousands vectors w/ 100 dims on commodity PCs), but searching doesn't look too bad (it takes ~30 msec for the same dataset) thanks to the skip list-like graph structure. On my current branch I wrapped {{BinaryDocValues}} to store vector values. However, exposing random access capability for doc values (or its extensions) can be controversial, so I'd like to propose a new codec which combines 1. the HNSW graph and 2. the vectors (float arrays). The new format for each vector field would have three parts (in other words, three files in a segment). They would look like: {code:java} Meta data and index part: +--+ | meta data| ++-+ | doc id | offset to first friend list for the doc | ++-+ | doc id | offset to first friend list for the doc | ++-+ | .. | ++-+ Graph data part: +-+---+-+-+ | friends list at layer N | friends list at layer N-1 | .. | friends list at level 0 | <- friends lists for doc 0 +-+---+-+-+ | friends list at layer N | friends list at layer N-1 | .. | friends list at level 0 | <- friends lists for doc 1 +-+---+-+-+ |.. | <- and so on +-+ Vector data part: +--+ | encoded vector value | <- vector value for doc 0 +--+ | encoded vector value | <- vector value for doc 1 +--+ | .. | <- and so on +--+ {code} - "meta data" includes: number of dimensions, distance function for similarity calculation, and other field level meta data - "doc id" is: doc ids having a vector value on this field - "friends list at layer N" is: a delta encoded target doc id list where each target doc is connected to the doc at Nth layer - "encoded vector value" is: a fixed length byte array. the offset of the value can be calculated on the fly. (limitations: each document can have only one vector value for each vector field) The graph data (friends lists) is relatively small so we could keep all of them on the Java heap for fast retrieval (though some off-heap strategy might be required for very large graphs). The vector data (vector values) is large and only the small fraction of it is needed when searching, so they should be accessed by on-demand style via the index. Feedback is welcomed. And I have a question about introducing new formats - is there a way to inject XXXFormat to the indexing chain so that we can add in this feature without any change on the {{lucene-core}}? was (Author: tomoko uchida): Just for status update: [my PoC branch|https://github.com/mocobeta/lucene-solr-mirror/tree/jira/LUCENE-9004-aknn] is still on pretty early stage and works only on one segment, but now it can index and query arbitrary vectors by [this example code|https://gist.github.com/mocobeta/5c174ee9fc6408470057a9e7d2020c45]. The newly added KnnGraphQuery is an extension of Query class so it should be able to be combined with other queries with some limitations, because the knn query cannot score entire dataset in nature. Indexing performance is terrible for now (it takes a few minutes for a hundred of thousands vectors w/ 100 dims on commodity PCs), but searching doesn't look too bad (it takes ~30 msec for the same dataset) thanks to the skip list-like graph structure. On my current branch I wrapped {{BinaryDocValues}} to store vector values.
[jira] [Updated] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-9004: -- Attachment: hnsw_layered_graph.png Status: Open (was: Open) > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Attachments: hnsw_layered_graph.png > > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The implementation I propose for Lucene is as follows. We need two new data > structures to encode the vectors and the graph. We can encode vectors using a > light wrapper around {{BinaryDocValues}} (we also want to encode the vector > dimension and have efficient conversion from bytes to floats). For the graph > we can use {{SortedNumericDocValues}} where the values we encode are the > docids of the related documents. Encoding the interdocument relations using > docids directly will make it relatively fast to traverse the graph since we > won't need to lookup through an id-field indirection. This choice limits us > to building a graph-per-segment since it would be impractical to maintain a > global graph for the whole index in the face of segment merges. However > graph-per-segment is a very natural at search time - we can traverse each > segments' graph independently and merge results as we do today for term-based > search. > At index time, however, merging graphs is somewhat challenging. While > indexing we build a graph incrementally, performing searches to construct > links among neighbors. When merging segments we must construct a new graph > containing elements of all the merged segments. Ideally we would somehow > preserve the work done when building the initial graphs, but at least as a > start I'd propose we construct a new graph from scratch when merging. The > process is going to be limited, at least initially, to graphs that can fit > in RAM since we require random access to the entire graph while constructing > it: In order to add links bidirectionally we must continually update existing > documents. > I think we want to express this API to users as a single joint > {{KnnGraphField}} abstraction that joins together the vectors and the graph > as a single joint field type. Mostly it just looks like a vector-valued > field, but has this graph attached to it. > I'll push a branch with my POC and would love to hear comments. It has many > nocommits, basic design is not really set, there is no Query implementation > and no integration iwth IndexSearcher, but it does work by some measure using > a standalone test class. I've tested with uniform random vectors and on my > laptop indexed 10K documents in around 10 seconds and searched them at 95% > recall (compared with exact nearest-neighbor baseline) at around 250 QPS. I > haven't made any attempt to use multithreaded search for this, but it is > amenable to per-segment concurrency. > [1] > https://www.semanticscholar.org/paper/Efficient-and-robust-approximate-nearest-neighbor-Malkov-Yashunin/699a2e3b653c69aff5cf7a9923793b974f8ca164 -- This message was sent by Atlassian Jira
[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955354#comment-16955354 ] Tomoko Uchida edited comment on LUCENE-9004 at 10/20/19 2:13 AM: - Hi, I've been trying to understand the Hierachical NSW paper, its previous work NSW model ([https://publications.hse.ru/mirror/pubs/share/folder/x5p6h7thif/direct/128296059]), and the PoC implementation. Just to clarify the discussion (and my understanding) here, I'd like to leave my comments about data structures/encodings. - We need two data structures as Michael Sokolov described, 1) the vector (value tied up to a vertex) and 2) layered undirectional graph (I think the figure below helps our understanding). - 1) the vector (float array) can simply be represented by {{BinaryDocValues}} as current PoC. For more efficient data access for large indexes, it may be required to introduce new Formats. - Indeed we need random access - it's not related to 1) the vector (the value) representation itself but required for 2), to traverse/search the "undirectional" graph. - Additionally, we need a skip-list like structure to encode the hierarchical graph (still not implemented). !hnsw_layered_graph.png! My feeling is that at least we need a new Format to represent 2), layered undirectional graph. Current PoC branch encodes the Layer 0 by {{SortedNumericDocValues}}, however, we will eventually intend to introduce multiple layers, it wouldn't be possible to represent those by existing doc values. (Please correct me if this is not true :) ) Or would it be possible that we introduce a new, dedicated auxiliary data structure / algorithm for HNSW apart from postings lists, like FST? I mean, for layered undirectional graph construction/traversal, we could have o.a.l.util.hnsw package. It's just an idea and I'm now attempting to delve into that... [~sokolov] have you considered it so far? was (Author: tomoko uchida): Hi, I've been trying to understand the Hierachical NSW paper, its previous work NSW model (https://publications.hse.ru/mirror/pubs/share/folder/x5p6h7thif/direct/128296059), and the PoC implementation. Just to clarify the discussion (and my understanding) here, I'd like to leave my comments about data structures/encodings. - We need two data structures as Michael Sokolov described, 1) the vector (value tied up to a vertex) and 2) layered undirectional graph (I think the figure below helps our understanding). - 1) the vector (float array) can simply be represented by {{BinaryDocValues}} as current PoC. For more efficient data access for large indexes, it may be required to introduce new Formats. - Indeed we need random access - it's not related to 1) the vector (the value) representation itself but required for 2), to traverse/search the "undirectional" graph. - Additionally, we need a skip-list like structure to encode the graph hierarchy (still not implemented). !hnsw_layered_graph.png! My feeling is that at least we need a new Format to represent 2), layered undirectional graph. Current PoC branch encodes the Layer 0 by {{SortedNumericDocValues}}, however, we will eventually intend to introduce multiple layers, it wouldn't be possible to represent those by existing doc values. (Please correct me if this is not true :) ) Or would it be possible that we introduce a new, dedicated auxiliary data structure / algorithm for HNSW apart from postings lists, like FST? I mean, for layered undirectional graph construction/traversal, we could have o.a.l.util.hnsw package. It's just an idea and I'm now attempting to delve into that... [~sokolov] have you considered it so far? > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Attachments: hnsw_layered_graph.png > > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level
[jira] [Commented] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955396#comment-16955396 ] Tomoko Uchida commented on LUCENE-9004: --- {quote}Indeed we need random access - it's not related to 1) the vector (the value) representation itself but required for 2), to traverse/search the "undirectional" graph. {quote} ah sorry it would not be correct, we would need random access for both of the vector and the graph... Please ignore this line. > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Attachments: hnsw_layered_graph.png > > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The implementation I propose for Lucene is as follows. We need two new data > structures to encode the vectors and the graph. We can encode vectors using a > light wrapper around {{BinaryDocValues}} (we also want to encode the vector > dimension and have efficient conversion from bytes to floats). For the graph > we can use {{SortedNumericDocValues}} where the values we encode are the > docids of the related documents. Encoding the interdocument relations using > docids directly will make it relatively fast to traverse the graph since we > won't need to lookup through an id-field indirection. This choice limits us > to building a graph-per-segment since it would be impractical to maintain a > global graph for the whole index in the face of segment merges. However > graph-per-segment is a very natural at search time - we can traverse each > segments' graph independently and merge results as we do today for term-based > search. > At index time, however, merging graphs is somewhat challenging. While > indexing we build a graph incrementally, performing searches to construct > links among neighbors. When merging segments we must construct a new graph > containing elements of all the merged segments. Ideally we would somehow > preserve the work done when building the initial graphs, but at least as a > start I'd propose we construct a new graph from scratch when merging. The > process is going to be limited, at least initially, to graphs that can fit > in RAM since we require random access to the entire graph while constructing > it: In order to add links bidirectionally we must continually update existing > documents. > I think we want to express this API to users as a single joint > {{KnnGraphField}} abstraction that joins together the vectors and the graph > as a single joint field type. Mostly it just looks like a vector-valued > field, but has this graph attached to it. > I'll push a branch with my POC and would love to hear comments. It has many > nocommits, basic design is not really set, there is no Query implementation > and no integration iwth IndexSearcher, but it does work by some measure using > a standalone test class. I've tested with uniform random vectors and on my > laptop indexed 10K documents in around 10 seconds and searched them at 95% > recall (compared with exact nearest-neighbor baseline) at around 250 QPS. I > haven't made any attempt to use multithreaded
[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955354#comment-16955354 ] Tomoko Uchida edited comment on LUCENE-9004 at 10/20/19 5:29 AM: - Hi, I've been trying to understand the Hierachical NSW paper, its previous work NSW model ([https://publications.hse.ru/mirror/pubs/share/folder/x5p6h7thif/direct/128296059]), and the PoC implementation. Just to clarify the discussion (and my understanding) here, I'd like to leave my comments about data structures/encodings. - We need two data structures as Michael Sokolov described, 1) the vector (value tied up to a vertex) and 2) layered undirectional graph (I think the figure below helps our understanding). - 1) the vector (float array) can simply be represented by {{BinaryDocValues}} as current PoC. For more efficient data access for large indexes, it may be required to introduce new Formats. - Indeed we need random access - i-t's not related to 1) the vector (the value) representation itself but required for 2), to traverse/search the "undirectional" graph.- - Additionally, we need a skip-list like structure to encode the hierarchical graph (still not implemented). !hnsw_layered_graph.png! My feeling is that at least we need a new Format to represent 2), layered undirectional graph. Current PoC branch encodes the Layer 0 by {{SortedNumericDocValues}}, however, we will eventually intend to introduce multiple layers, it wouldn't be possible to represent those by existing doc values. (Please correct me if this is not true :) ) Or would it be possible that we introduce a new, dedicated auxiliary data structure / algorithm for HNSW apart from postings lists, like FST? I mean, for layered undirectional graph construction/traversal, we could have o.a.l.util.hnsw package. It's just an idea and I'm now attempting to delve into that... [~sokolov] have you considered it so far? was (Author: tomoko uchida): Hi, I've been trying to understand the Hierachical NSW paper, its previous work NSW model ([https://publications.hse.ru/mirror/pubs/share/folder/x5p6h7thif/direct/128296059]), and the PoC implementation. Just to clarify the discussion (and my understanding) here, I'd like to leave my comments about data structures/encodings. - We need two data structures as Michael Sokolov described, 1) the vector (value tied up to a vertex) and 2) layered undirectional graph (I think the figure below helps our understanding). - 1) the vector (float array) can simply be represented by {{BinaryDocValues}} as current PoC. For more efficient data access for large indexes, it may be required to introduce new Formats. - Indeed we need random access - it's not related to 1) the vector (the value) representation itself but required for 2), to traverse/search the "undirectional" graph. - Additionally, we need a skip-list like structure to encode the hierarchical graph (still not implemented). !hnsw_layered_graph.png! My feeling is that at least we need a new Format to represent 2), layered undirectional graph. Current PoC branch encodes the Layer 0 by {{SortedNumericDocValues}}, however, we will eventually intend to introduce multiple layers, it wouldn't be possible to represent those by existing doc values. (Please correct me if this is not true :) ) Or would it be possible that we introduce a new, dedicated auxiliary data structure / algorithm for HNSW apart from postings lists, like FST? I mean, for layered undirectional graph construction/traversal, we could have o.a.l.util.hnsw package. It's just an idea and I'm now attempting to delve into that... [~sokolov] have you considered it so far? > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Attachments: hnsw_layered_graph.png > > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high
[jira] [Resolved] (LUCENE-8998) OverviewImplTest.testIsOptimized reproducible failure
[ https://issues.apache.org/jira/browse/LUCENE-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida resolved LUCENE-8998. --- Fix Version/s: 8.3 master (9.0) Resolution: Fixed Thank you [~hossman] for reporting. > OverviewImplTest.testIsOptimized reproducible failure > - > > Key: LUCENE-8998 > URL: https://issues.apache.org/jira/browse/LUCENE-8998 > Project: Lucene - Core > Issue Type: Bug > Components: luke >Reporter: Chris M. Hostetter >Assignee: Tomoko Uchida >Priority: Major > Fix For: master (9.0), 8.3 > > Attachments: LUCENE-8998.patch > > > The following seed reproduces reliably for me on master... > (NOTE: the {{ERROR StatusLogger}} messages include the one about the > AccessControlException occur even with other seeds when the test passes) > {noformat} > [mkdir] Created dir: /home/hossman/lucene/alt_dev/lucene/build/luke/test > [junit4:pickseed] Seed property 'tests.seed' already defined: 9123DD19C50D658 > [mkdir] Created dir: > /home/hossman/lucene/alt_dev/lucene/build/luke/test/temp >[junit4] says cześć! Master seed: 9123DD19C50D658 >[junit4] Executing 1 suite with 1 JVM. >[junit4] >[junit4] Started J0 PID(8576@localhost). >[junit4] Suite: org.apache.lucene.luke.models.overview.OverviewImplTest >[junit4] 2> ERROR StatusLogger No Log4j 2 configuration file found. > Using default configuration (logging only errors to the console), or user > programmatically provided configurations. Set system property 'log4j2.debug' > to show Log4j 2 internal initialization logging. See > https://logging.apache.org/log4j/2.x/manual/configuration.html for > instructions on how to configure Log4j 2 >[junit4] 2> ERROR StatusLogger Could not reconfigure JMX >[junit4] 2> java.security.AccessControlException: access denied > ("javax.management.MBeanServerPermission" "createMBeanServer") >[junit4] 2> at > java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) >[junit4] 2> at > java.base/java.security.AccessController.checkPermission(AccessController.java:897) >[junit4] 2> at > java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:322) >[junit4] 2> at > java.management/java.lang.management.ManagementFactory.getPlatformMBeanServer(ManagementFactory.java:479) >[junit4] 2> at > org.apache.logging.log4j.core.jmx.Server.reregisterMBeansAfterReconfigure(Server.java:140) >[junit4] 2> at > org.apache.logging.log4j.core.LoggerContext.setConfiguration(LoggerContext.java:559) >[junit4] 2> at > org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:620) >[junit4] 2> at > org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:637) >[junit4] 2> at > org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:231) >[junit4] 2> at > org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:153) >[junit4] 2> at > org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:45) >[junit4] 2> at > org.apache.logging.log4j.LogManager.getContext(LogManager.java:194) >[junit4] 2> at > org.apache.logging.log4j.LogManager.getLogger(LogManager.java:581) >[junit4] 2> at > org.apache.lucene.luke.util.LoggerFactory.getLogger(LoggerFactory.java:70) >[junit4] 2> at > org.apache.lucene.luke.models.util.IndexUtils.(IndexUtils.java:62) >[junit4] 2> at > org.apache.lucene.luke.models.LukeModel.(LukeModel.java:60) >[junit4] 2> at > org.apache.lucene.luke.models.overview.OverviewImpl.(OverviewImpl.java:50) >[junit4] 2> at > org.apache.lucene.luke.models.overview.OverviewImplTest.testIsOptimized(OverviewImplTest.java:77) >[junit4] 2> at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >[junit4] 2> at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >[junit4] 2> at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >[junit4] 2> at > java.base/java.lang.reflect.Method.invoke(Method.java:566) >[junit4] 2> at > com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750) >[junit4] 2> at > com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938) >[junit4] 2> at > com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974) >[junit4] 2> at >
[jira] [Commented] (LUCENE-8998) OverviewImplTest.testIsOptimized reproducible failure
[ https://issues.apache.org/jira/browse/LUCENE-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16945009#comment-16945009 ] Tomoko Uchida commented on LUCENE-8998: --- I'm not sure this old feature has a meaning with current Lucene's segments management strategy, but I attached a patch that uses NoMergePolicy to prevent segment merges when testing {{isOptimized()}} method. > OverviewImplTest.testIsOptimized reproducible failure > - > > Key: LUCENE-8998 > URL: https://issues.apache.org/jira/browse/LUCENE-8998 > Project: Lucene - Core > Issue Type: Bug > Components: luke >Reporter: Chris M. Hostetter >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-8998.patch > > > The following seed reproduces reliably for me on master... > (NOTE: the {{ERROR StatusLogger}} messages include the one about the > AccessControlException occur even with other seeds when the test passes) > {noformat} > [mkdir] Created dir: /home/hossman/lucene/alt_dev/lucene/build/luke/test > [junit4:pickseed] Seed property 'tests.seed' already defined: 9123DD19C50D658 > [mkdir] Created dir: > /home/hossman/lucene/alt_dev/lucene/build/luke/test/temp >[junit4] says cześć! Master seed: 9123DD19C50D658 >[junit4] Executing 1 suite with 1 JVM. >[junit4] >[junit4] Started J0 PID(8576@localhost). >[junit4] Suite: org.apache.lucene.luke.models.overview.OverviewImplTest >[junit4] 2> ERROR StatusLogger No Log4j 2 configuration file found. > Using default configuration (logging only errors to the console), or user > programmatically provided configurations. Set system property 'log4j2.debug' > to show Log4j 2 internal initialization logging. See > https://logging.apache.org/log4j/2.x/manual/configuration.html for > instructions on how to configure Log4j 2 >[junit4] 2> ERROR StatusLogger Could not reconfigure JMX >[junit4] 2> java.security.AccessControlException: access denied > ("javax.management.MBeanServerPermission" "createMBeanServer") >[junit4] 2> at > java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) >[junit4] 2> at > java.base/java.security.AccessController.checkPermission(AccessController.java:897) >[junit4] 2> at > java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:322) >[junit4] 2> at > java.management/java.lang.management.ManagementFactory.getPlatformMBeanServer(ManagementFactory.java:479) >[junit4] 2> at > org.apache.logging.log4j.core.jmx.Server.reregisterMBeansAfterReconfigure(Server.java:140) >[junit4] 2> at > org.apache.logging.log4j.core.LoggerContext.setConfiguration(LoggerContext.java:559) >[junit4] 2> at > org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:620) >[junit4] 2> at > org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:637) >[junit4] 2> at > org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:231) >[junit4] 2> at > org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:153) >[junit4] 2> at > org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:45) >[junit4] 2> at > org.apache.logging.log4j.LogManager.getContext(LogManager.java:194) >[junit4] 2> at > org.apache.logging.log4j.LogManager.getLogger(LogManager.java:581) >[junit4] 2> at > org.apache.lucene.luke.util.LoggerFactory.getLogger(LoggerFactory.java:70) >[junit4] 2> at > org.apache.lucene.luke.models.util.IndexUtils.(IndexUtils.java:62) >[junit4] 2> at > org.apache.lucene.luke.models.LukeModel.(LukeModel.java:60) >[junit4] 2> at > org.apache.lucene.luke.models.overview.OverviewImpl.(OverviewImpl.java:50) >[junit4] 2> at > org.apache.lucene.luke.models.overview.OverviewImplTest.testIsOptimized(OverviewImplTest.java:77) >[junit4] 2> at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >[junit4] 2> at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >[junit4] 2> at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >[junit4] 2> at > java.base/java.lang.reflect.Method.invoke(Method.java:566) >[junit4] 2> at > com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750) >[junit4] 2> at > com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938) >[junit4] 2> at >
[jira] [Updated] (LUCENE-8998) OverviewImplTest.testIsOptimized reproducible failure
[ https://issues.apache.org/jira/browse/LUCENE-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-8998: -- Attachment: LUCENE-8998.patch Status: Open (was: Open) > OverviewImplTest.testIsOptimized reproducible failure > - > > Key: LUCENE-8998 > URL: https://issues.apache.org/jira/browse/LUCENE-8998 > Project: Lucene - Core > Issue Type: Bug > Components: luke >Reporter: Chris M. Hostetter >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-8998.patch > > > The following seed reproduces reliably for me on master... > (NOTE: the {{ERROR StatusLogger}} messages include the one about the > AccessControlException occur even with other seeds when the test passes) > {noformat} > [mkdir] Created dir: /home/hossman/lucene/alt_dev/lucene/build/luke/test > [junit4:pickseed] Seed property 'tests.seed' already defined: 9123DD19C50D658 > [mkdir] Created dir: > /home/hossman/lucene/alt_dev/lucene/build/luke/test/temp >[junit4] says cześć! Master seed: 9123DD19C50D658 >[junit4] Executing 1 suite with 1 JVM. >[junit4] >[junit4] Started J0 PID(8576@localhost). >[junit4] Suite: org.apache.lucene.luke.models.overview.OverviewImplTest >[junit4] 2> ERROR StatusLogger No Log4j 2 configuration file found. > Using default configuration (logging only errors to the console), or user > programmatically provided configurations. Set system property 'log4j2.debug' > to show Log4j 2 internal initialization logging. See > https://logging.apache.org/log4j/2.x/manual/configuration.html for > instructions on how to configure Log4j 2 >[junit4] 2> ERROR StatusLogger Could not reconfigure JMX >[junit4] 2> java.security.AccessControlException: access denied > ("javax.management.MBeanServerPermission" "createMBeanServer") >[junit4] 2> at > java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) >[junit4] 2> at > java.base/java.security.AccessController.checkPermission(AccessController.java:897) >[junit4] 2> at > java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:322) >[junit4] 2> at > java.management/java.lang.management.ManagementFactory.getPlatformMBeanServer(ManagementFactory.java:479) >[junit4] 2> at > org.apache.logging.log4j.core.jmx.Server.reregisterMBeansAfterReconfigure(Server.java:140) >[junit4] 2> at > org.apache.logging.log4j.core.LoggerContext.setConfiguration(LoggerContext.java:559) >[junit4] 2> at > org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:620) >[junit4] 2> at > org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:637) >[junit4] 2> at > org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:231) >[junit4] 2> at > org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:153) >[junit4] 2> at > org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:45) >[junit4] 2> at > org.apache.logging.log4j.LogManager.getContext(LogManager.java:194) >[junit4] 2> at > org.apache.logging.log4j.LogManager.getLogger(LogManager.java:581) >[junit4] 2> at > org.apache.lucene.luke.util.LoggerFactory.getLogger(LoggerFactory.java:70) >[junit4] 2> at > org.apache.lucene.luke.models.util.IndexUtils.(IndexUtils.java:62) >[junit4] 2> at > org.apache.lucene.luke.models.LukeModel.(LukeModel.java:60) >[junit4] 2> at > org.apache.lucene.luke.models.overview.OverviewImpl.(OverviewImpl.java:50) >[junit4] 2> at > org.apache.lucene.luke.models.overview.OverviewImplTest.testIsOptimized(OverviewImplTest.java:77) >[junit4] 2> at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >[junit4] 2> at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >[junit4] 2> at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >[junit4] 2> at > java.base/java.lang.reflect.Method.invoke(Method.java:566) >[junit4] 2> at > com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750) >[junit4] 2> at > com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938) >[junit4] 2> at > com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974) >[junit4] 2> at > com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988) >
[jira] [Resolved] (LUCENE-9000) Cannot resolve classes from org.apache.lucene.core plugin and others
[ https://issues.apache.org/jira/browse/LUCENE-9000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida resolved LUCENE-9000. --- Resolution: Not A Problem > Cannot resolve classes from org.apache.lucene.core plugin and others > > > Key: LUCENE-9000 > URL: https://issues.apache.org/jira/browse/LUCENE-9000 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Affects Versions: 7.1 > Environment: DTP consume that org.apache.lucene.core plugin and > trying to compile with the class got this error. > The compilation error for import of:- import > org.apache.lucene.queryParser.ParseException; import > org.apache.lucene.queryParser.QueryParser; import > org.apache.lucene.search.Searcher; > > > >Reporter: Rosa Casillas >Priority: Major > Fix For: 7.1 > > > we are consuming the *org.apache.lucene.core* plugin in our source code. > Wherein we updated the *org.apache.lucene.core* version from *2.9.0 to > 7.1.0**(supported by Photon)*. > But doing that gives us the compilation error in below statements, > > _import org.apache.lucene.queryParser.ParseException;_ > _import org.apache.lucene.queryParser.QueryParser;_ > _import org.apache.lucene.search.Searcher;_ > > Can you please let us know how to resolve these imports? > > > We took a look on the content and noticed that that class is not direct there > We tried with Classis class but even was not able to resolve it, We dont > have issue with previous version (2.9.0) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9000) Cannot resolve classes from org.apache.lucene.core plugin and others
[ https://issues.apache.org/jira/browse/LUCENE-9000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16945017#comment-16945017 ] Tomoko Uchida commented on LUCENE-9000: --- As far as class imports, those lines should work with 7.1.0. {code:java} import org.apache.lucene.queryparser.classic.ParseException; import org.apache.lucene.queryparser.classic.QueryParser; import org.apache.lucene.search.IndexSearcher; {code} Please refer the 7.1.0 Javdocs and post questions to mailing lists instead of opening issues, because this Jira isn't a help desk. [https://lucene.apache.org/core/7_1_0/] [https://lucene.apache.org/core/discussion.html] > Cannot resolve classes from org.apache.lucene.core plugin and others > > > Key: LUCENE-9000 > URL: https://issues.apache.org/jira/browse/LUCENE-9000 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Affects Versions: 7.1 > Environment: DTP consume that org.apache.lucene.core plugin and > trying to compile with the class got this error. > The compilation error for import of:- import > org.apache.lucene.queryParser.ParseException; import > org.apache.lucene.queryParser.QueryParser; import > org.apache.lucene.search.Searcher; > > > >Reporter: Rosa Casillas >Priority: Major > Fix For: 7.1 > > > we are consuming the *org.apache.lucene.core* plugin in our source code. > Wherein we updated the *org.apache.lucene.core* version from *2.9.0 to > 7.1.0**(supported by Photon)*. > But doing that gives us the compilation error in below statements, > > _import org.apache.lucene.queryParser.ParseException;_ > _import org.apache.lucene.queryParser.QueryParser;_ > _import org.apache.lucene.search.Searcher;_ > > Can you please let us know how to resolve these imports? > > > We took a look on the content and noticed that that class is not direct there > We tried with Classis class but even was not able to resolve it, We dont > have issue with previous version (2.9.0) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950620#comment-16950620 ] Tomoko Uchida edited comment on LUCENE-9004 at 10/13/19 11:46 PM: -- I also really look forward to get this feature into Lucene! FWIW, it seems like that there are several versions of the HNSW paper and I think this is the latest revision (v4). Pseudo algorithm parts have been refined or evolved from the original version ([~sokolov] introduced here) though I've not yet closely checked the diffs and have no idea about they have significant impacts here. [https://arxiv.org/abs/1603.09320] (I will check out / play with the PoC branch and share my findings, if it would be useful.) was (Author: tomoko uchida): I also really look forward to get this feature into Lucene! FWIW, it seems like that there are several versions of the HSNW paper and I think this is the latest revision (v4). Pseudo algorithm parts have been refined or evolved from the original version ([~sokolov] introduced here) though I've not yet closely checked the diffs and have no idea about they have significant impacts here. https://arxiv.org/abs/1603.09320 (I will check out / play with the PoC branch and share my findings, if it would be useful.) > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The implementation I propose for Lucene is as follows. We need two new data > structures to encode the vectors and the graph. We can encode vectors using a > light wrapper around {{BinaryDocValues}} (we also want to encode the vector > dimension and have efficient conversion from bytes to floats). For the graph > we can use {{SortedNumericDocValues}} where the values we encode are the > docids of the related documents. Encoding the interdocument relations using > docids directly will make it relatively fast to traverse the graph since we > won't need to lookup through an id-field indirection. This choice limits us > to building a graph-per-segment since it would be impractical to maintain a > global graph for the whole index in the face of segment merges. However > graph-per-segment is a very natural at search time - we can traverse each > segments' graph independently and merge results as we do today for term-based > search. > At index time, however, merging graphs is somewhat challenging. While > indexing we build a graph incrementally, performing searches to construct > links among neighbors. When merging segments we must construct a new graph > containing elements of all the merged segments. Ideally we would somehow > preserve the work done when building the initial graphs, but at least as a > start I'd propose we construct a new graph from scratch when merging. The > process is going to be limited, at least initially, to graphs that can fit > in RAM since we require random access to the entire graph while constructing > it: In order to add links bidirectionally we must continually update existing > documents. > I think we want
[jira] [Commented] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950620#comment-16950620 ] Tomoko Uchida commented on LUCENE-9004: --- I also really look forward to get this feature into Lucene! FWIW, it seems like that there are several versions of the HSNW paper and I think this is the latest revision (v4). Pseudo algorithm parts have been refined or evolved from the original version ([~sokolov] introduced here) though I've not yet closely checked the diffs and have no idea about they have significant impacts here. https://arxiv.org/abs/1603.09320 (I will check out / play with the PoC branch and share my findings, if it would be useful.) > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The implementation I propose for Lucene is as follows. We need two new data > structures to encode the vectors and the graph. We can encode vectors using a > light wrapper around {{BinaryDocValues}} (we also want to encode the vector > dimension and have efficient conversion from bytes to floats). For the graph > we can use {{SortedNumericDocValues}} where the values we encode are the > docids of the related documents. Encoding the interdocument relations using > docids directly will make it relatively fast to traverse the graph since we > won't need to lookup through an id-field indirection. This choice limits us > to building a graph-per-segment since it would be impractical to maintain a > global graph for the whole index in the face of segment merges. However > graph-per-segment is a very natural at search time - we can traverse each > segments' graph independently and merge results as we do today for term-based > search. > At index time, however, merging graphs is somewhat challenging. While > indexing we build a graph incrementally, performing searches to construct > links among neighbors. When merging segments we must construct a new graph > containing elements of all the merged segments. Ideally we would somehow > preserve the work done when building the initial graphs, but at least as a > start I'd propose we construct a new graph from scratch when merging. The > process is going to be limited, at least initially, to graphs that can fit > in RAM since we require random access to the entire graph while constructing > it: In order to add links bidirectionally we must continually update existing > documents. > I think we want to express this API to users as a single joint > {{KnnGraphField}} abstraction that joins together the vectors and the graph > as a single joint field type. Mostly it just looks like a vector-valued > field, but has this graph attached to it. > I'll push a branch with my POC and would love to hear comments. It has many > nocommits, basic design is not really set, there is no Query implementation > and no integration iwth IndexSearcher, but it does work by some measure using > a standalone test class. I've tested with uniform random vectors and on my > laptop indexed 10K documents in around 10
[jira] [Comment Edited] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034121#comment-17034121 ] Tomoko Uchida edited comment on LUCENE-9201 at 2/11/20 4:43 AM: One small thing about the equivalent "ant documentation" (gradle built-in Javadoc task or our customized one), I think it'd be better the javadoc generation task outputs all javadocs to module-wide common directory (e.g., {{lucene/build/docs}} or {{solr/build/docs}}) just like ant build does, instead of each module's build directory. This makes things easy for succeeding "broken links check" (running {{checkJavadocLinks.py}} - or its replacement?) and release managers work that should includes updating the official documentation site ([https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo#ReleaseTodo-Pushdocs,changesandjavadocstotheCMSproductiontree]). was (Author: tomoko uchida): One small thing about the equivalent "ant documentation" (gradle built-in Javadoc task or our customized one), I think it'd be better the javadoc generation task should output all javadocs to module-wide common directory (e.g., {{lucene/build/docs}} or {{solr/build/docs}}) just ant build does, instead of each module's build directory. This makes things easy for succeeding "broken links check" (running {{checkJavadocLinks.py}} - or its replacement?) and release managers work that should includes updating the official documentation site ([https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo#ReleaseTodo-Pushdocs,changesandjavadocstotheCMSproductiontree]). > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 40m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-9201: -- Attachment: LUCENE-9201-missing-docs.patch > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, > LUCENE-9201-missing-docs.patch, LUCENE-9201.patch, javadocGRADLE.png, > javadocHTML4.png, javadocHTML5.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042976#comment-17042976 ] Tomoko Uchida commented on LUCENE-9201: --- I attached [^LUCENE-9201-missing-docs.patch], where the only difference to [^LUCENE-9201.patch] is this line. {code:java} < +dirs.each { dir -> --- > +dirs.findAll { it.exists() }.each { dir -> {code} (I'm not so familiar with groovy though, collect() seems not to be a filter but a mapper or iterator. [http://docs.groovy-lang.org/next/html/documentation/working-with-collections.html#_iterating_on_a_list]) {quote}The task as it is now doesn't really pass for me (python script fails for certain subprojects)? This is expected, right? {quote} Yes, this is expected. Some projects (lucene/backward-codecs, lucene/queryparser, etc.) have legacy package.html file in their source but Gradle Javadoc task ignores them, so the generated Javadocs lack package summary. The python linter detects that and correctly fails for now. This is the reason why I disabled this task for precommit: [https://github.com/apache/lucene-solr/pull/1267/files#diff-5a33a39a6ec8b4facbd4db4cdfb4131a]. We have to fix Javadoc task (another issue may be needed?), to make the linter happy. I think is is fine that we have the "missing docs check" task on the master soon, to track how the javadoc task is broken or fixed. I'd like to commit it to the master tomorrow if there is no disapproval, since it's a bit late on JST... > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, > LUCENE-9201-missing-docs.patch, LUCENE-9201.patch, javadocGRADLE.png, > javadocHTML4.png, javadocHTML5.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9242) Gradle Javadoc task should output the same documents as Ant
[ https://issues.apache.org/jira/browse/LUCENE-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-9242: -- Description: "javadoc" task for the Gradle build does not correctly output package summaries, since it ignores "package.html" file in the source tree (so the Python linter {{checkJavaDocs.py}} detects that and fails for now.) Also the "javadoc" task should make inter-module links just as Ant build does. See for more details: LUCENE-9201 was: "javadoc" task for the Gradle build does not correctly output package summaries, since it ignores "package.html" file in the source tree (so the Python linter {{checkJavaDocs.py}} detects that and fails for now.) See for more details: LUCENE-9201 > Gradle Javadoc task should output the same documents as Ant > --- > > Key: LUCENE-9242 > URL: https://issues.apache.org/jira/browse/LUCENE-9242 > Project: Lucene - Core > Issue Type: Sub-task > Components: general/javadocs >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Priority: Major > > "javadoc" task for the Gradle build does not correctly output package > summaries, since it ignores "package.html" file in the source tree (so the > Python linter {{checkJavaDocs.py}} detects that and fails for now.) > Also the "javadoc" task should make inter-module links just as Ant build does. > See for more details: LUCENE-9201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9242) Gradle Javadoc task should output the same documents as Ant
[ https://issues.apache.org/jira/browse/LUCENE-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-9242: -- Summary: Gradle Javadoc task should output the same documents as Ant (was: Gradle Javadoc task does not include package summaries) > Gradle Javadoc task should output the same documents as Ant > --- > > Key: LUCENE-9242 > URL: https://issues.apache.org/jira/browse/LUCENE-9242 > Project: Lucene - Core > Issue Type: Sub-task > Components: general/javadocs >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Priority: Major > > "javadoc" task for the Gradle build does not correctly output package > summaries, since it ignores "package.html" file in the source tree (so the > Python linter {{checkJavaDocs.py}} detects that and fails for now.) > See for more details: LUCENE-9201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044631#comment-17044631 ] Tomoko Uchida commented on LUCENE-9201: --- I finally got how the "invoke-module-javadoc" macro resolves inter-module links (which isn't yet covered by current gradle build). I changed the subject for LUCENE-9242 to "Gradle Javadoc task should output the same documents as Ant". > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, > LUCENE-9201-missing-docs.patch, LUCENE-9201.patch, javadocGRADLE.png, > javadocHTML4.png, javadocHTML5.png > > Time Spent: 4.5h > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-14295) Add the parameter descriptionn about "discardCompoundToken" for JapaneseTokenizer
Tomoko Uchida created SOLR-14295: Summary: Add the parameter descriptionn about "discardCompoundToken" for JapaneseTokenizer Key: SOLR-14295 URL: https://issues.apache.org/jira/browse/SOLR-14295 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: documentation Reporter: Tomoko Uchida Assignee: Tomoko Uchida In [LUCENE-9123], a parameter {{discardCompoundToken}} was added to JapaneseTokenizer(Factory). The ref-guide needs to be updated to let Solr users know this change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-11746) numeric fields need better error handling for prefix/wildcard syntax -- consider uniform support for "foo:* == foo:[* TO *]"
[ https://issues.apache.org/jira/browse/SOLR-11746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048257#comment-17048257 ] Tomoko Uchida commented on SOLR-11746: -- It seems like the Ref Guide build is now failing due to the changes here. {code:java} solr-ref-guide $ ant build-site ... build-site: [java] Relative link points at id that doesn't exist in dest: #differences-between-lucenes-classic-query-parser-and-solrs-standard-query-parser [java] ... source: file:/mnt/hdd/repo/lucene-solr/solr/build/solr-ref-guide/html-site/the-standard-query-parser.html [java] Relative link points at id that doesn't exist in dest: the-standard-query-parser.html#differences-between-lucenes-classic-query-parser-and-solrs-standard-query-parser [java] ... source: file:/mnt/hdd/repo/lucene-solr/solr/build/solr-ref-guide/html-site/common-query-parameters.html [java] Processed 2611 links (1932 relative) to 3477 anchors in 262 files [java] Total of 2 problems found BUILD FAILED /mnt/hdd/repo/lucene-solr/solr/solr-ref-guide/build.xml:251: Java returned: 255 {code} The build works for me when I removed those two lines. {code:java} --- a/solr/solr-ref-guide/src/common-query-parameters.adoc +++ b/solr/solr-ref-guide/src/common-query-parameters.adoc @@ -102,7 +102,7 @@ fq=+popularity:[10 TO *] +section:0 * The document sets from each filter query are cached independently. Thus, concerning the previous examples: use a single `fq` containing two mandatory clauses if those clauses appear together often, and use two separate `fq` parameters if they are relatively independent. (To learn about tuning cache sizes and making sure a filter cache actually exists, see <>.) -* It is also possible to use <> inside the `fq` to cache clauses individually and - among other things - to achieve union of cached filter queries. +// * It is also possible to use <> inside the `fq` to cache clauses individually and - among other things - to achieve union of cached filter queries. diff --git a/solr/solr-ref-guide/src/the-standard-query-parser.adoc b/solr/solr-ref-guide/src/the-standard-query-parser.adoc index c572e503e5b..3a3cd7f958d 100644 --- a/solr/solr-ref-guide/src/the-standard-query-parser.adoc +++ b/solr/solr-ref-guide/src/the-standard-query-parser.adoc @@ -174,7 +174,7 @@ The brackets around a query determine its inclusiveness. * You can mix these types so one end of the range is inclusive and the other is exclusive. Here's an example: `count:{1 TO 10]` Wildcards, `*`, can also be used for either or both endpoints to specify an open-ended range query. -This is a <<#differences-between-lucenes-classic-query-parser-and-solrs-standard-query-parser,divergence from Lucene's Classic Query Parser>>. +// This is a <<#differences-between-lucenes-classic-query-parser-and-solrs-standard-query-parser,divergence from Lucene's Classic Query Parser>>. {code} I know nothing about this issue, just noticed the broken links when I updated the ref-guide on another issue... > numeric fields need better error handling for prefix/wildcard syntax -- > consider uniform support for "foo:* == foo:[* TO *]" > > > Key: SOLR-11746 > URL: https://issues.apache.org/jira/browse/SOLR-11746 > Project: Solr > Issue Type: Bug >Affects Versions: 7.0 >Reporter: Chris M. Hostetter >Assignee: Houston Putman >Priority: Major > Fix For: master (9.0), 8.5 > > Attachments: SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, > SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, > SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch > > > On the solr-user mailing list, Torsten Krah pointed out that with Trie > numeric fields, query syntax such as {{foo_d:\*}} has been functionality > equivilent to {{foo_d:\[\* TO \*]}} and asked why this was not also supported > for Point based numeric fields. > The fact that this type of syntax works (for {{indexed="true"}} Trie fields) > appears to have been an (untested, undocumented) fluke of Trie fields given > that they use indexed terms for the (encoded) numeric terms and inherit the > default implementation of {{FieldType.getPrefixQuery}} which produces a > prefix query against the {{""}} (empty string) term. > (Note that this syntax has aparently _*never*_ worked for Trie fields with > {{indexed="false" docValues="true"}} ) > In general, we should assess the behavior users attempt a prefix/wildcard > syntax query against numeric fields, as currently the behavior is largely > non-sensical: prefix/wildcard syntax frequently match no docs w/o any sort > of error, and the aformentioned {{numeric_field:*}} behaves inconsistently
[jira] [Updated] (SOLR-14295) Add the parameter description about "discardCompoundToken" for JapaneseTokenizer
[ https://issues.apache.org/jira/browse/SOLR-14295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated SOLR-14295: - Summary: Add the parameter description about "discardCompoundToken" for JapaneseTokenizer (was: Add the parameter descriptionn about "discardCompoundToken" for JapaneseTokenizer) > Add the parameter description about "discardCompoundToken" for > JapaneseTokenizer > > > Key: SOLR-14295 > URL: https://issues.apache.org/jira/browse/SOLR-14295 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: documentation >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Minor > Attachments: SOLR-14295.patch > > > In [LUCENE-9123], a parameter {{discardCompoundToken}} was added to > JapaneseTokenizer(Factory). > The ref-guide needs to be updated to let Solr users know this change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14295) Add the parameter description about "discardCompoundToken" for JapaneseTokenizer
[ https://issues.apache.org/jira/browse/SOLR-14295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated SOLR-14295: - Description: In [LUCENE-9123], a parameter {{discardCompoundToken}} was added to JapaneseTokenizer(Factory). The ref-guide needs to be updated to let Solr users know about this change. was: In [LUCENE-9123], a parameter {{discardCompoundToken}} was added to JapaneseTokenizer(Factory). The ref-guide needs to be updated to let Solr users know this change. > Add the parameter description about "discardCompoundToken" for > JapaneseTokenizer > > > Key: SOLR-14295 > URL: https://issues.apache.org/jira/browse/SOLR-14295 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: documentation >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Minor > Attachments: SOLR-14295.patch > > > In [LUCENE-9123], a parameter {{discardCompoundToken}} was added to > JapaneseTokenizer(Factory). > The ref-guide needs to be updated to let Solr users know about this change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-14295) Add the parameter description about "discardCompoundToken" for JapaneseTokenizer
[ https://issues.apache.org/jira/browse/SOLR-14295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida resolved SOLR-14295. -- Fix Version/s: 8.5 master (9.0) Resolution: Fixed > Add the parameter description about "discardCompoundToken" for > JapaneseTokenizer > > > Key: SOLR-14295 > URL: https://issues.apache.org/jira/browse/SOLR-14295 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: documentation >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Minor > Fix For: master (9.0), 8.5 > > Attachments: SOLR-14295.patch > > > In [LUCENE-9123], a parameter {{discardCompoundToken}} was added to > JapaneseTokenizer(Factory). > The ref-guide needs to be updated to let Solr users know about this change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14295) Add the parameter descriptionn about "discardCompoundToken" for JapaneseTokenizer
[ https://issues.apache.org/jira/browse/SOLR-14295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated SOLR-14295: - Attachment: SOLR-14295.patch > Add the parameter descriptionn about "discardCompoundToken" for > JapaneseTokenizer > - > > Key: SOLR-14295 > URL: https://issues.apache.org/jira/browse/SOLR-14295 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: documentation >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Minor > Attachments: SOLR-14295.patch > > > In [LUCENE-9123], a parameter {{discardCompoundToken}} was added to > JapaneseTokenizer(Factory). > The ref-guide needs to be updated to let Solr users know this change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9242) Gradle Javadoc task should output the same documents as Ant
[ https://issues.apache.org/jira/browse/LUCENE-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048696#comment-17048696 ] Tomoko Uchida edited comment on LUCENE-9242 at 3/1/20 10:37 PM: I opened a draft PR [https://github.com/apache/lucene-solr/pull/1304]. This adds a gradle task, named {{invokeJavadoc}}, which generates Javadocs with inter-module hyperlinks by invoking Ant javadoc task. Also this passes {{checkMissingJavadocs}} check. The task can be called as below: {code:java} # generate javadocs for each project $ ./gradlew :lucene:core:invokeJavadoc {code} or, {code:java} # generate javadocs for all projects at once $ ./gradlew invokeJavadoc {code} The work isn't completed yet, but the most important parts are already ported. Quick replies to comments on LUCENE-9201 will be following: {quote}It is my personal preference to have a project-scope granularity. This way you can run project-scoped task (like gradlew -p lucene/core javadoc). My personal take on assembling "distributions" is to have a separate project that just takes what it needs from other projects and puts it together (with any tweaks required). This makes it easier to reason about how a distribution is assembled and from where, while each project just takes care of itself. {quote} I'd love this approach, however, when I was trying I noticed that it looks difficult to properly generate inter-module hyperlinks without affecting the existing javadoc's path hierarchy (already published on the apache.org web site), if we want to place generated javadocs under ${sub_project_root}/build/docs/javadoc (gradle's default javadoc destination). The fundamental problem here I think is, in order to make hyperlinks from a module A to another module B, we need to know the effective relative path from module A to module B and pass it to the Javadoc Tool. I aggregated all javadocs into {{lucene/build/docs}} or {{solr/build/docs}}, just as the Ant build does, to resolve the relative paths. I might miss something - please let me know if my understanding isn't correct. {quote}for "directly call the javadoc tool" we may want to use the ant task as a start. This ant task is doing quite a bit of work above and beyond what the tool is doing (if you look at the relevant code to ant, you may be shocked!). {quote} As the first step I tried to reproduce the principal Ant macros : "invoke-javadoc" (in lucene/common-build.xml) and "invoke-module-javadoc" (in lucene/module-build.xml) on gradle build. By doing so, there's now no missing package summaries and inter-module links will be generated. (Current setups to resolve the hyperlinks look quite redundant, I think we can do it in more sophisticated ways.) {quote}A custom javadoc invocation is certainly possible and could possibly make things easier in the long run. {quote} {quote}as a second step you can look at computing package list for a module yourself (it may allow invoking the tool directly). {quote} Yes we will probably be able to throw away all ant tasks and only rely on pure gradle code. Some extra effort will be needed to faithfully transfer the elaborate ant setups into corresponding gradle scripts... {quote}You'd need to declare inputs/ outputs properly though so that it is skippable. Those javadoc invocations take a long time in precommit. {quote} I passed inputs/outputs to the task not to needlessly repeat the javadoc invocation. It seems to work - {{ant.javadoc}} is called only when the java source or output directory is changed. was (Author: tomoko uchida): I opened a draft PR [https://github.com/apache/lucene-solr/pull/1304]. This adds a gradle task, named {{invokeJavadoc}}, which generates Javadocs with inter-module hyperlinks by invoking Ant javadoc task. Also this passes {{checkMissingJavadocs}} check. The task can be called as below: {code:java} # generate javadocs for each project $ ./gradlew :lucene:core:invokeJavadoc {code} or, {code:java} # generate javadocs for all projects at once $ ./gradlew invokeJavadoc {code} The work isn't completed yet, but the most important parts are already ported. Quick replies to comments on LUCENE-9201 will be following: {quote}It is my personal preference to have a project-scope granularity. This way you can run project-scoped task (like gradlew -p lucene/core javadoc). My personal take on assembling "distributions" is to have a separate project that just takes what it needs from other projects and puts it together (with any tweaks required). This makes it easier to reason about how a distribution is assembled and from where, while each project just takes care of itself. {quote} I'd love this approach, however, when I was trying I noticed that it looks difficult to properly generate inter-module hyperlinks without affecting the existing javadoc's path hierarchy (already published on the apache.org web
[jira] [Comment Edited] (LUCENE-9242) Gradle Javadoc task should output the same documents as Ant
[ https://issues.apache.org/jira/browse/LUCENE-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048696#comment-17048696 ] Tomoko Uchida edited comment on LUCENE-9242 at 3/1/20 10:38 PM: I opened a draft PR [https://github.com/apache/lucene-solr/pull/1304]. This adds a gradle task, named {{invokeJavadoc}}, which generates Javadocs with inter-module hyperlinks by invoking Ant javadoc task. Also this passes {{checkMissingJavadocs}} check. The task can be called as below: {code:java} # generate javadocs for each project $ ./gradlew :lucene:core:invokeJavadoc {code} or, {code:java} # generate javadocs for all projects at once $ ./gradlew invokeJavadoc {code} The work isn't completed yet, but the most important parts are already ported. Quick replies to comments on LUCENE-9201 will be following: {quote}It is my personal preference to have a project-scope granularity. This way you can run project-scoped task (like gradlew -p lucene/core javadoc). My personal take on assembling "distributions" is to have a separate project that just takes what it needs from other projects and puts it together (with any tweaks required). This makes it easier to reason about how a distribution is assembled and from where, while each project just takes care of itself. {quote} I'd love this approach, however, when I was trying I noticed that it looks difficult to properly generate inter-module hyperlinks without affecting the existing javadoc's path hierarchy (already published on the apache.org web site), if we want to place generated javadocs under ${sub_project_root}/build/docs/javadoc (gradle's default javadoc destination). The fundamental problem here I think is, in order to make hyperlinks from a module A to another module B, we need to know the effective relative path from module A to module B and pass it to the Javadoc Tool. I aggregated all javadocs into {{lucene/build/docs}} or {{solr/build/docs}}, just as the Ant build does, to resolve the relative paths. I might miss something - please let me know if my understanding isn't correct. {quote}for "directly call the javadoc tool" we may want to use the ant task as a start. This ant task is doing quite a bit of work above and beyond what the tool is doing (if you look at the relevant code to ant, you may be shocked!). {quote} As the first step I tried to reproduce the principal Ant macros : "invoke-javadoc" (in lucene/common-build.xml) and "invoke-module-javadoc" (in lucene/module-build.xml) on gradle build. By doing so, there's now no missing package summaries and inter-module links will be generated. (Current setups to resolve the hyperlinks look quite redundant, I think we can do it in more sophisticated ways.) {quote}A custom javadoc invocation is certainly possible and could possibly make things easier in the long run. {quote} {quote}as a second step you can look at computing package list for a module yourself (it may allow invoking the tool directly). {quote} Yes we will probably be able to throw away all ant tasks and only rely on pure gradle code. Some extra effort will be needed to faithfully transfer the elaborate ant setups into corresponding gradle scripts... {quote}You'd need to declare inputs/ outputs properly though so that it is skippable. Those javadoc invocations take a long time in precommit. {quote} I passed inputs/outputs to the task not to needlessly repeat the javadoc invocation. It seems to work - {{ant.javadoc}} is called only when the java source or output directory is changed. was (Author: tomoko uchida): I opened a draft PR [https://github.com/apache/lucene-solr/pull/1304]. This adds a gradle task, named {{invokeJavadoc}}, which generates Javadocs with inter-module hyperlinks by invoking Ant javadoc task. Also this passes {{checkMissingJavadocs}} check. The task can be called as below: {code:java} # generate javadocs for each project $ ./gradlew :lucene:core:invokeJavadoc {code} or, {code:java} # generate javadocs for all projects at once $ ./gradlew invokeJavadoc {code} The work isn't completed yet, but the most important parts are already ported. Quick replies to comments on LUCENE-9201 will be following: {quote}It is my personal preference to have a project-scope granularity. This way you can run project-scoped task (like gradlew -p lucene/core javadoc). My personal take on assembling "distributions" is to have a separate project that just takes what it needs from other projects and puts it together (with any tweaks required). This makes it easier to reason about how a distribution is assembled and from where, while each project just takes care of itself. {quote} I'd love this approach, however, when I was trying I noticed that it looks difficult to properly generate inter-module hyperlinks without affecting the existing javadoc's path hierarchy (already published on the apache.org
[jira] [Commented] (LUCENE-9242) Gradle Javadoc task should output the same documents as Ant
[ https://issues.apache.org/jira/browse/LUCENE-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048696#comment-17048696 ] Tomoko Uchida commented on LUCENE-9242: --- I opened a draft PR [https://github.com/apache/lucene-solr/pull/1304]. This adds a gradle task, named {{invokeJavadoc}}, which generates Javadocs with inter-module hyperlinks by invoking Ant javadoc task. Also this passes {{checkMissingJavadocs}} check. The task can be called as below: {code:java} # generate javadocs for each project $ ./gradlew :lucene:core:invokeJavadoc {code} or, {code:java} # generate javadocs for all projects at once $ ./gradlew invokeJavadoc {code} The work isn't completed yet, but the most important parts are already ported. Quick replies to comments on LUCENE-9201 will be following: {quote}It is my personal preference to have a project-scope granularity. This way you can run project-scoped task (like gradlew -p lucene/core javadoc). My personal take on assembling "distributions" is to have a separate project that just takes what it needs from other projects and puts it together (with any tweaks required). This makes it easier to reason about how a distribution is assembled and from where, while each project just takes care of itself. {quote} I'd love this approach, however, when I was trying I noticed that it looks difficult to properly generate inter-module hyperlinks without affecting the existing javadoc's path hierarchy (already published on the apache.org web site), if we want to place generated javadocs under ${sub_project_root}/build/docs/javadoc (gradle's default javadoc destination). The fundamental problem here I think is, in order to make hyperlinks from a module A to another module B, we need to know the effective relative path from module A to module B and pass it to the Javadoc Tool. I aggregated all javadocs into {{lucene/build/docs}} or {{solr/build/docs}}, just as the Ant build does, to resolve the relative paths. I might miss something - please let me know if my understanding isn't correct. {quote}for "directly call the javadoc tool" we may want to use the ant task as a start. This ant task is doing quite a bit of work above and beyond what the tool is doing (if you look at the relevant code to ant, you may be shocked!). {quote} As the first step I tried to reproduce the principal Ant macros : "invoke-javadoc" (in lucene/common-build.xml) and "invoke-module-javadoc" (in lucene/module-build.xml) on gradle build. By doing so, there's now no missing package summaries and inter-module links will be generated. (Current setups to resolve the hyper links looks quite redundant, I think we can do it more sophisticated ways.) {quote}A custom javadoc invocation is certainly possible and could possibly make things easier in the long run. {quote} {quote}as a second step you can look at computing package list for a module yourself (it may allow invoking the tool directly). {quote} Yes we will probably be able to throw away all ant tasks and only rely on pure gradle code. Some extra effort will be needed to faithfully transfer the elaborate ant setups into corresponding gradle scripts... {quote}You'd need to declare inputs/ outputs properly though so that it is skippable. Those javadoc invocations take a long time in precommit. {quote} I passed inputs/outputs to the task not to needlessly repeat the javadoc invocation. It seems to work - {{ant.javadoc}} is called only when the java source or output directory is changed. > Gradle Javadoc task should output the same documents as Ant > --- > > Key: LUCENE-9242 > URL: https://issues.apache.org/jira/browse/LUCENE-9242 > Project: Lucene - Core > Issue Type: Sub-task > Components: general/javadocs >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > "javadoc" task for the Gradle build does not correctly output package > summaries, since it ignores "package.html" file in the source tree (so the > Python linter {{checkJavaDocs.py}} detects that and fails for now.) > Also the "javadoc" task should make inter-module links just as Ant build does. > See for more details: LUCENE-9201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9242) Gradle Javadoc task should output the same documents as Ant
[ https://issues.apache.org/jira/browse/LUCENE-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048702#comment-17048702 ] Tomoko Uchida commented on LUCENE-9242: --- [~dweiss] [~rcmuir] Could you take a look at the PR? It seems to work for me but I am not sure if this is a good start or not, any thoughts or brief comments are welcomed. > Gradle Javadoc task should output the same documents as Ant > --- > > Key: LUCENE-9242 > URL: https://issues.apache.org/jira/browse/LUCENE-9242 > Project: Lucene - Core > Issue Type: Sub-task > Components: general/javadocs >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > "javadoc" task for the Gradle build does not correctly output package > summaries, since it ignores "package.html" file in the source tree (so the > Python linter {{checkJavaDocs.py}} detects that and fails for now.) > Also the "javadoc" task should make inter-module links just as Ant build does. > See for more details: LUCENE-9201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020273#comment-17020273 ] Tomoko Uchida commented on LUCENE-9123: --- Thank you [~cm] and [~johtani] for your comments. Because fixing the synonym filter would not be easy and may take time, so I think the patch is a good quick-fix for the majority of users. And yes, we could recommend UniDic dictionary instead of search mode, when we resolve https://issues.apache.org/jira/browse/LUCENE-4056 and https://issues.apache.org/jira/browse/LUCENE-8816. About n-best mode, I am not sure we should consider about it here. The "emitting n-best tokens" and "emitting compound tokens on search mode" are different concept, though both emits multiple tokens at the same position. As far as I read the code, they are orthogonal. (Maybe we should open another issue for n-best, but it seems to be difficult to find a solution for that without fixing the synonym filter.) > JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter > --- > > Key: LUCENE-9123 > URL: https://issues.apache.org/jira/browse/LUCENE-9123 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 8.4 >Reporter: Kazuaki Hiraga >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch > > > JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with > both of SynonymGraphFilter and SynonymFilter when JT generates multiple > tokens as an output. If we use `mode=normal`, it should be fine. However, we > would like to use decomposed tokens that can maximize to chance to increase > recall. > Snippet of schema: > {code:xml} > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > > > synonyms="lang/synonyms_ja.txt" > tokenizerFactory="solr.JapaneseTokenizerFactory"/> > > > tags="lang/stoptags_ja.txt" /> > > > > > > minimumLength="4"/> > > > > > {code} > An synonym entry that generates error: > {noformat} > 株式会社,コーポレーション > {noformat} > The following is an output on console: > {noformat} > $ ./bin/solr create_core -c jp_test -d ../config/solrconfs > ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] > Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 > (got: 0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050924#comment-17050924 ] Tomoko Uchida commented on LUCENE-9136: --- [~jtibshirani] [~irvingzhang] thanks for your hard work here! {quote}I was thinking we could actually reuse the existing `PostingsFormat` and `DocValuesFormat` implementations. {quote} Actually the first implementation (by Michael Sokolov) for the HNSW was wrapping DocValuesFormat to avoid code duplication. However, this approach - reusing existing code - could lead another concern from the perspective of maintenance. (From the beginning, Adrien Grand suggested a dedicated format instead of hacking doc values.) This is the main reason why I introduced a new format for knn search in LUCENE-9004. I'm not strongly against to the "reusing existing format" strategy if it's the best way here, just would like to share my feeling that it could be a bit controversial and you might need to convince maintainers that the (pretty new) feature does not cause any problems/concerns on future maintenance for Lucene core, if you implement it on existing formats/readers. I have not closely looked at your PR yet - sorry if my comments completely besides the point (you might already talk with other committers about the implementation in another channel, eg. private chats?). > Introduce IVFFlat to Lucene for ANN similarity search > - > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Xin-Chun Zhang >Priority: Major > Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, > image-2020-02-16-15-05-02-451.png > > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-base algorithms, such as KD-tree; > # Hashing methods, such as LSH (Local Sensitive Hashing); > # Product quantization based algorithms, such as IVFFlat; > # Graph-base algorithms, such as HNSW, SSG, NSG; > where IVFFlat and HNSW are the most popular ones among all the VR algorithms. > IVFFlat is better for high-precision applications such as face recognition, > while HNSW performs better in general scenarios including recommendation and > personalized advertisement. *The recall ratio of IVFFlat could be gradually > increased by adjusting the query parameter (nprobe), while it's hard for HNSW > to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. > Recently, the implementation of HNSW (Hierarchical Navigable Small World, > LUCENE-9004) for Lucene, has made great progress. The issue draws attention > of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. > As an alternative for solving ANN similarity search problems, IVFFlat is also > very popular with many users and supporters. Compared with HNSW, IVFFlat has > smaller index size but requires k-means clustering, while HNSW is faster in > query (no training required) but requires extra storage for saving graphs > [indexing 1M > vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]]. > Another advantage is that IVFFlat can be faster and more accurate when > enables GPU parallel computing (current not support in Java). Both algorithms > have their merits and demerits. Since HNSW is now under development, it may > be better to provide both implementations (HNSW && IVFFlat) for potential > users who are faced with very different scenarios and want to more choices. > The latest branch is > [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat] --
[jira] [Commented] (LUCENE-9259) NGramFilter use wrong argument name for preserve option
[ https://issues.apache.org/jira/browse/LUCENE-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051311#comment-17051311 ] Tomoko Uchida commented on LUCENE-9259: --- The added tests and ref-guide examples also look fine to me. I'm planning to commit the patch to the master and branch_8x in shortly. [~Paul Pazderski] Can I ask your email address for crediting? > NGramFilter use wrong argument name for preserve option > --- > > Key: LUCENE-9259 > URL: https://issues.apache.org/jira/browse/LUCENE-9259 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 7.4, 8.0 >Reporter: Paul Pazderski >Priority: Minor > Attachments: LUCENE-9259.patch > > > LUCENE-7960 added the possibility to preserve the original term when using > NGram filters. The documentation says to enable it with 'preserveOriginal' > and it works for EdgeNGram filter. But NGram filter requires the initial > planned option 'keepShortTerms' to enable this feature. > This inconsistency is confusing. I'll provide a patch with a possible fix. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9242) Gradle Javadoc task should output the same documents as Ant
[ https://issues.apache.org/jira/browse/LUCENE-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049160#comment-17049160 ] Tomoko Uchida edited comment on LUCENE-9242 at 3/2/20 1:47 PM: --- My description about the inter-module links was not good... let me share an example to clarify a bit the problem I am bumping (though not sure if such redundant explanation is needed here). There is a link from {{o.a.l.a.core.KeywordAnalyzer}} to {{o.a.l.a.Analyzer}}. The {{KeywordAnalyzer}} is placed in "analysis/common" project and {{Analyzer}} is placed in "core" project so the link is an inter-module (inter-project) link from "analysis/common" to "core". [https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html] The link is represented by a relative path: {{}}, this is automatically generated by Ant javadoc task ({{nested element}}). The link element (without "offline" option) raises errors when the "href" path or URL cannot be resolved, so you need to be sure to the path is substantial when invoking javadoc task. This means you have to prepare the very same folder structure as you'd like to publish at the point of executing javadoc. Still we can create absolute URL links with "offline" option if the relative path is not available (Solr has many absolute URL links to Lucene javadocs), but I feel it's a bit much for inter-project links... was (Author: tomoko uchida): My description about the inter-module links was not good... let me share an example to clarify a bit the problem I am bumping (though not sure if such redundant explanation is needed here). There is a link from {{o.a.l.a.core.KeywordAnalyzer}} to {{o.a.l.a.Analyzer}}. The {{KeywordAnalyzer}} is placed in "analysis/common" project and {{Analyzer}} is placed in "core" project so the link is an inter-module (inter-project) link from "analysis/common" to "core". [https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html] The link is represented by a relative path: {{}}, this is automatically generated by Ant javadoc task ({{nested element}}). The link element (without "offline" option) raises errors when the "href" path or URL cannot be resolved, so you need to be sure to the path is substantial when invoking javadoc task. This means you have to prepare the very same folder structure as you'd like to publish at the point of executing javadoc. Still we can create absolute URL links with "offline" option if the relative path is not available (Solr has many absolute URL links to Lucene javadocs), but I feel it's a bit much for inter-project links... > Gradle Javadoc task should output the same documents as Ant > --- > > Key: LUCENE-9242 > URL: https://issues.apache.org/jira/browse/LUCENE-9242 > Project: Lucene - Core > Issue Type: Sub-task > Components: general/javadocs >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > "javadoc" task for the Gradle build does not correctly output package > summaries, since it ignores "package.html" file in the source tree (so the > Python linter {{checkJavaDocs.py}} detects that and fails for now.) > Also the "javadoc" task should make inter-module links just as Ant build does. > See for more details: LUCENE-9201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9242) Gradle Javadoc task should output the same documents as Ant
[ https://issues.apache.org/jira/browse/LUCENE-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049160#comment-17049160 ] Tomoko Uchida commented on LUCENE-9242: --- My description about the inter-module links was not good... let me share an example to clarify a bit the problem I am bumping (though not sure if such redundant explanation is needed here). There is a link from {{o.a.l.a.core.KeywordAnalyzer}} to {{o.a.l.a.Analyzer}}. The {{KeywordAnalyzer}} is placed in "analysis/common" project and {{Analyzer}} is placed in "core" project so the link is an inter-module (inter-project) link from "analysis/common" to "core". [https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html] The link is represented by a relative path: {{}}, this is automatically generated by Ant javadoc task ({{nested element}}). The link element (without "offline" option) raises errors when the "href" path or URL cannot be resolved, so you need to be sure to the path is substantial when invoking javadoc task. This means you have to prepare the very same folder structure as you'd like to publish at the point of executing javadoc. Still we can create absolute URL links with "offline" option if the relative path is not available (Solr has many absolute URL links to Lucene javadocs), but I feel it's a bit much for inter-project links... > Gradle Javadoc task should output the same documents as Ant > --- > > Key: LUCENE-9242 > URL: https://issues.apache.org/jira/browse/LUCENE-9242 > Project: Lucene - Core > Issue Type: Sub-task > Components: general/javadocs >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > "javadoc" task for the Gradle build does not correctly output package > summaries, since it ignores "package.html" file in the source tree (so the > Python linter {{checkJavaDocs.py}} detects that and fails for now.) > Also the "javadoc" task should make inter-module links just as Ant build does. > See for more details: LUCENE-9201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9259) NGramFilter use wrong argument name for preserve option
[ https://issues.apache.org/jira/browse/LUCENE-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050263#comment-17050263 ] Tomoko Uchida commented on LUCENE-9259: --- [~Paul Pazderski] thanks, good catch. The fix looks good to me (this preserve backward compatibility), let me check the tests and documentation. > NGramFilter use wrong argument name for preserve option > --- > > Key: LUCENE-9259 > URL: https://issues.apache.org/jira/browse/LUCENE-9259 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 7.4, 8.0 >Reporter: Paul Pazderski >Priority: Minor > Attachments: LUCENE-9259.patch > > > LUCENE-7960 added the possibility to preserve the original term when using > NGram filters. The documentation says to enable it with 'preserveOriginal' > and it works for EdgeNGram filter. But NGram filter requires the initial > planned option 'keepShortTerms' to enable this feature. > This inconsistency is confusing. I'll provide a patch with a possible fix. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-11746) numeric fields need better error handling for prefix/wildcard syntax -- consider uniform support for "foo:* == foo:[* TO *]"
[ https://issues.apache.org/jira/browse/SOLR-11746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049991#comment-17049991 ] Tomoko Uchida commented on SOLR-11746: -- I did clean checkout from the Apache repo, and saw the build still doesn't work for me. {code} git clone g...@github.com:apache/lucene-solr.git cd lucene-solr/ ant -f solr/solr-ref-guide/build.xml build-site ... build-site: [java] Relative link points at id that doesn't exist in dest: the-standard-query-parser.html#differences-between-lucenes-classic-query-parser-and-solrs-standard-query-parser [java] ... source: file:/mnt/hdd/tmp/lucene-solr/solr/build/solr-ref-guide/html-site/common-query-parameters.html [java] Relative link points at id that doesn't exist in dest: #differences-between-lucenes-classic-query-parser-and-solrs-standard-query-parser [java] ... source: file:/mnt/hdd/tmp/lucene-solr/solr/build/solr-ref-guide/html-site/the-standard-query-parser.html [java] Processed 2610 links (1930 relative) to 3728 anchors in 262 files [java] Total of 2 problems found BUILD FAILED {code} I have not looked at the details yet, but the problem could occur only for me (caused by my setups?). > numeric fields need better error handling for prefix/wildcard syntax -- > consider uniform support for "foo:* == foo:[* TO *]" > > > Key: SOLR-11746 > URL: https://issues.apache.org/jira/browse/SOLR-11746 > Project: Solr > Issue Type: Bug >Affects Versions: 7.0 >Reporter: Chris M. Hostetter >Assignee: Houston Putman >Priority: Major > Fix For: master (9.0), 8.5 > > Attachments: SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, > SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, > SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch > > > On the solr-user mailing list, Torsten Krah pointed out that with Trie > numeric fields, query syntax such as {{foo_d:\*}} has been functionality > equivilent to {{foo_d:\[\* TO \*]}} and asked why this was not also supported > for Point based numeric fields. > The fact that this type of syntax works (for {{indexed="true"}} Trie fields) > appears to have been an (untested, undocumented) fluke of Trie fields given > that they use indexed terms for the (encoded) numeric terms and inherit the > default implementation of {{FieldType.getPrefixQuery}} which produces a > prefix query against the {{""}} (empty string) term. > (Note that this syntax has aparently _*never*_ worked for Trie fields with > {{indexed="false" docValues="true"}} ) > In general, we should assess the behavior users attempt a prefix/wildcard > syntax query against numeric fields, as currently the behavior is largely > non-sensical: prefix/wildcard syntax frequently match no docs w/o any sort > of error, and the aformentioned {{numeric_field:*}} behaves inconsistently > between points/trie fields and between indexed/docValued trie fields. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023757#comment-17023757 ] Tomoko Uchida commented on LUCENE-9123: --- I opened an issue for the SynonymGraphFilter: LUCENE-9173. Also I found an issue about multi-word synonyms LUCENE-8137, it seems like it's a different issue discussed here (but I'm not fully sure of that). > JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter > --- > > Key: LUCENE-9123 > URL: https://issues.apache.org/jira/browse/LUCENE-9123 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 8.4 >Reporter: Kazuaki Hiraga >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9123.patch, LUCENE-9123_8x.patch > > > JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with > both of SynonymGraphFilter and SynonymFilter when JT generates multiple > tokens as an output. If we use `mode=normal`, it should be fine. However, we > would like to use decomposed tokens that can maximize to chance to increase > recall. > Snippet of schema: > {code:xml} > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > > > synonyms="lang/synonyms_ja.txt" > tokenizerFactory="solr.JapaneseTokenizerFactory"/> > > > tags="lang/stoptags_ja.txt" /> > > > > > > minimumLength="4"/> > > > > > {code} > An synonym entry that generates error: > {noformat} > 株式会社,コーポレーション > {noformat} > The following is an output on console: > {noformat} > $ ./bin/solr create_core -c jp_test -d ../config/solrconfs > ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] > Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 > (got: 0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9173) SynonymGraphFilter doesn't correctly consume decompounded tokens (branched token graph)
Tomoko Uchida created LUCENE-9173: - Summary: SynonymGraphFilter doesn't correctly consume decompounded tokens (branched token graph) Key: LUCENE-9173 URL: https://issues.apache.org/jira/browse/LUCENE-9173 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Reporter: Tomoko Uchida This is a derived issue from LUCENE-9123. When the tokenizer that is given to SynonymGraphFilter decompound tokens or emit multiple tokens at the same position, SynonymGraphFilter cannot correctly handle them (an exception will be thrown). For example, JapaneseTokenizer (mode=SEARCH) would emit a token and two decompounded tokens for the text "株式会社": {code:java} 株式会社 (positionIncrement=0, positionLength=2) 株式 (positionIncrement=1, positionLength=1) 会社 (positionIncrement=1, positionLength=1) {code} Then if we give synonym "株式会社,コーポレーション" by SynonymGraphFilter (set tokenizerFactory=JapaneseTokenizerFactory) this exception is thrown. {code:java} Caused by: java.lang.IllegalArgumentException: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 (got: 0) at org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:325) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38] at org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38] at org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38] at org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.loadSynonyms(SynonymGraphFilterFactory.java:179) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38] at org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.inform(SynonymGraphFilterFactory.java:154) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38] {code} This isn't only limited to JapaneseTokenizer but a more general issue about handling branched token graph (decompounded tokens in the midstream). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023778#comment-17023778 ] Tomoko Uchida commented on LUCENE-9123: --- When reproducing this issue I noticed that JapaneseTokenizer (mode=search) gives positionIncrements=1 for the decompounded token "株式" instead of 0. This looks strange to me, is this an expected behaviour? If not, this may affect the synonyms handling? > JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter > --- > > Key: LUCENE-9123 > URL: https://issues.apache.org/jira/browse/LUCENE-9123 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 8.4 >Reporter: Kazuaki Hiraga >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9123.patch, LUCENE-9123_8x.patch > > > JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with > both of SynonymGraphFilter and SynonymFilter when JT generates multiple > tokens as an output. If we use `mode=normal`, it should be fine. However, we > would like to use decomposed tokens that can maximize to chance to increase > recall. > Snippet of schema: > {code:xml} > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > > > synonyms="lang/synonyms_ja.txt" > tokenizerFactory="solr.JapaneseTokenizerFactory"/> > > > tags="lang/stoptags_ja.txt" /> > > > > > > minimumLength="4"/> > > > > > {code} > An synonym entry that generates error: > {noformat} > 株式会社,コーポレーション > {noformat} > The following is an output on console: > {noformat} > $ ./bin/solr create_core -c jp_test -d ../config/solrconfs > ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] > Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 > (got: 0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023863#comment-17023863 ] Tomoko Uchida commented on LUCENE-9123: --- Thanks [~h.kazuaki] for updating the patches. +1, I will commit them with CHANGES and MIGRATE entries next weekend or so (sorry for the delay, I may not have time to test them locally right now). Meanwhile can you tell us the e-mail address that will be logged as the author of the patch? > JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter > --- > > Key: LUCENE-9123 > URL: https://issues.apache.org/jira/browse/LUCENE-9123 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 8.4 >Reporter: Kazuaki Hiraga >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9123.patch, LUCENE-9123_8x.patch > > > JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with > both of SynonymGraphFilter and SynonymFilter when JT generates multiple > tokens as an output. If we use `mode=normal`, it should be fine. However, we > would like to use decomposed tokens that can maximize to chance to increase > recall. > Snippet of schema: > {code:xml} > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > > > synonyms="lang/synonyms_ja.txt" > tokenizerFactory="solr.JapaneseTokenizerFactory"/> > > > tags="lang/stoptags_ja.txt" /> > > > > > > minimumLength="4"/> > > > > > {code} > An synonym entry that generates error: > {noformat} > 株式会社,コーポレーション > {noformat} > The following is an output on console: > {noformat} > $ ./bin/solr create_core -c jp_test -d ../config/solrconfs > ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] > Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 > (got: 0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9173) SynonymGraphFilter doesn't correctly consume decompounded tokens (branched token graph)
[ https://issues.apache.org/jira/browse/LUCENE-9173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-9173: -- Description: This is a derived issue from LUCENE-9123. When the tokenizer that is given to SynonymGraphFilter decompound tokens or emit multiple tokens at the same position, SynonymGraphFilter cannot correctly handle them (an exception will be thrown). For example, JapaneseTokenizer (mode=SEARCH) would emit a token and two decompounded tokens for the text "株式会社": {code:java} 株式会社 (positionIncrement=0, positionLength=2) 株式 (positionIncrement=1, positionLength=1) 会社 (positionIncrement=1, positionLength=1) {code} Then if we give a synonym "株式会社,コーポレーション" by SynonymGraphFilterFactory (set tokenizerFactory=JapaneseTokenizerFactory) this exception is thrown. {code:java} Caused by: java.lang.IllegalArgumentException: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 (got: 0) at org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:325) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38] at org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38] at org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38] at org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.loadSynonyms(SynonymGraphFilterFactory.java:179) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38] at org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.inform(SynonymGraphFilterFactory.java:154) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38] {code} This isn't only limited to JapaneseTokenizer but a more general issue about handling branched token graph (decompounded tokens in the midstream). was: This is a derived issue from LUCENE-9123. When the tokenizer that is given to SynonymGraphFilter decompound tokens or emit multiple tokens at the same position, SynonymGraphFilter cannot correctly handle them (an exception will be thrown). For example, JapaneseTokenizer (mode=SEARCH) would emit a token and two decompounded tokens for the text "株式会社": {code:java} 株式会社 (positionIncrement=0, positionLength=2) 株式 (positionIncrement=1, positionLength=1) 会社 (positionIncrement=1, positionLength=1) {code} Then if we give synonym "株式会社,コーポレーション" by SynonymGraphFilter (set tokenizerFactory=JapaneseTokenizerFactory) this exception is thrown. {code:java} Caused by: java.lang.IllegalArgumentException: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 (got: 0) at org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:325) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38] at org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38] at org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38] at org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.loadSynonyms(SynonymGraphFilterFactory.java:179) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38] at org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.inform(SynonymGraphFilterFactory.java:154) ~[lucene-analyzers-common-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38] {code} This isn't only limited to JapaneseTokenizer but a more general issue about handling branched token graph (decompounded tokens in the midstream). > SynonymGraphFilter doesn't correctly consume decompounded tokens (branched > token graph) > > > Key: LUCENE-9173 > URL: https://issues.apache.org/jira/browse/LUCENE-9173 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Tomoko Uchida >Priority: Minor > > This is a derived issue from LUCENE-9123. > When the tokenizer that is given to SynonymGraphFilter decompound tokens or > emit multiple tokens at the same position,
[jira] [Comment Edited] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023778#comment-17023778 ] Tomoko Uchida edited comment on LUCENE-9123 at 1/26/20 6:47 PM: When reproducing this issue I noticed that JapaneseTokenizer (mode=search) gives positionIncrements=1 for the decompounded token "株式" instead of 0. This looks strange to me, is this an expected behaviour? If not, this may affect the synonyms handling? And please ignore my previous comment... I was mistsken about position increment. was (Author: tomoko uchida): When reproducing this issue I noticed that JapaneseTokenizer (mode=search) gives positionIncrements=1 for the decompounded token "株式" instead of 0. This looks strange to me, is this an expected behaviour? If not, this may affect the synonyms handling? > JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter > --- > > Key: LUCENE-9123 > URL: https://issues.apache.org/jira/browse/LUCENE-9123 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 8.4 >Reporter: Kazuaki Hiraga >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9123.patch, LUCENE-9123_8x.patch > > > JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with > both of SynonymGraphFilter and SynonymFilter when JT generates multiple > tokens as an output. If we use `mode=normal`, it should be fine. However, we > would like to use decomposed tokens that can maximize to chance to increase > recall. > Snippet of schema: > {code:xml} > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > > > synonyms="lang/synonyms_ja.txt" > tokenizerFactory="solr.JapaneseTokenizerFactory"/> > > > tags="lang/stoptags_ja.txt" /> > > > > > > minimumLength="4"/> > > > > > {code} > An synonym entry that generates error: > {noformat} > 株式会社,コーポレーション > {noformat} > The following is an output on console: > {noformat} > $ ./bin/solr create_core -c jp_test -d ../config/solrconfs > ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] > Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 > (got: 0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023778#comment-17023778 ] Tomoko Uchida edited comment on LUCENE-9123 at 1/26/20 6:49 PM: When reproducing this issue I noticed that JapaneseTokenizer (mode=search) gives positionIncrements=1 for the decompounded token "株式" instead of 0. This looks strange to me, is this an expected behaviour? If not, this may affect the synonyms handling? please ignore above my comment... I was mistsken about position increment. was (Author: tomoko uchida): When reproducing this issue I noticed that JapaneseTokenizer (mode=search) gives positionIncrements=1 for the decompounded token "株式" instead of 0. This looks strange to me, is this an expected behaviour? If not, this may affect the synonyms handling? And please ignore my previous comment... I was mistsken about position increment. > JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter > --- > > Key: LUCENE-9123 > URL: https://issues.apache.org/jira/browse/LUCENE-9123 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 8.4 >Reporter: Kazuaki Hiraga >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9123.patch, LUCENE-9123_8x.patch > > > JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with > both of SynonymGraphFilter and SynonymFilter when JT generates multiple > tokens as an output. If we use `mode=normal`, it should be fine. However, we > would like to use decomposed tokens that can maximize to chance to increase > recall. > Snippet of schema: > {code:xml} > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > > > synonyms="lang/synonyms_ja.txt" > tokenizerFactory="solr.JapaneseTokenizerFactory"/> > > > tags="lang/stoptags_ja.txt" /> > > > > > > minimumLength="4"/> > > > > > {code} > An synonym entry that generates error: > {noformat} > 株式会社,コーポレーション > {noformat} > The following is an output on console: > {noformat} > $ ./bin/solr create_core -c jp_test -d ../config/solrconfs > ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] > Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 > (got: 0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031275#comment-17031275 ] Tomoko Uchida commented on LUCENE-9004: --- The context is created per reader basis, not per query basis. You don't share your test code, but I suspect you open new IndexReader every time you issue a query? I think if you reuse one index reader (index searcher) through the test, the memory usage is stable between 2 and 4 GB. Anyway, yes, the static cache (for the graph structure) isn't good implementation, that is one reason why I said the HNSW branch is still on pretty early stage... > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Attachments: hnsw_layered_graph.png > > Time Spent: 3h 10m > Remaining Estimate: 0h > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The implementation I propose for Lucene is as follows. We need two new data > structures to encode the vectors and the graph. We can encode vectors using a > light wrapper around {{BinaryDocValues}} (we also want to encode the vector > dimension and have efficient conversion from bytes to floats). For the graph > we can use {{SortedNumericDocValues}} where the values we encode are the > docids of the related documents. Encoding the interdocument relations using > docids directly will make it relatively fast to traverse the graph since we > won't need to lookup through an id-field indirection. This choice limits us > to building a graph-per-segment since it would be impractical to maintain a > global graph for the whole index in the face of segment merges. However > graph-per-segment is a very natural at search time - we can traverse each > segments' graph independently and merge results as we do today for term-based > search. > At index time, however, merging graphs is somewhat challenging. While > indexing we build a graph incrementally, performing searches to construct > links among neighbors. When merging segments we must construct a new graph > containing elements of all the merged segments. Ideally we would somehow > preserve the work done when building the initial graphs, but at least as a > start I'd propose we construct a new graph from scratch when merging. The > process is going to be limited, at least initially, to graphs that can fit > in RAM since we require random access to the entire graph while constructing > it: In order to add links bidirectionally we must continually update existing > documents. > I think we want to express this API to users as a single joint > {{KnnGraphField}} abstraction that joins together the vectors and the graph > as a single joint field type. Mostly it just looks like a vector-valued > field, but has this graph attached to it. > I'll push a branch with my POC and would love to hear comments. It has many > nocommits, basic design is not really set, there is no Query implementation > and no integration iwth IndexSearcher, but it does work by some measure using > a standalone test class. I've tested with uniform random vectors and on my > laptop
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031702#comment-17031702 ] Tomoko Uchida commented on LUCENE-9201: --- [~erickerickson] I added sub-tasks equivalent to the ant targets. - -check-broken-links (this internally calls {{dev-tools/scripts/checkJavadocLinks.py}}) - -check-missing-javadocs (this internally calls {{dev-tools/scripts/checkJavaDocs.py}} ) And I opened a PR :) [https://github.com/apache/lucene-solr/pull/1242] I think this is almost equivalent to Ant's "documentation-lint", with some notes below. [~erickerickson] [~dweiss] Could you review it? *Note:* For now, Python linters - {{checkBrokenLinks}}, {{checkMissingJavadocsClass}} and {{checkMissingJavadocsMethod}} - will fail because the Gradle-generated Javadocs seems to be slightly different to Ant-generated ones. * Javadoc directory structure: "ant documentation" generates "analyzers-common" docs dir for "analysis/common" module, but "gradlew javadoc" generates "analysis/common" for the same module. I think we can adjust the structure, but where is the suitable place to do so? * Package summary: "ant documentation" uses "package.html" as package summary description, but "gradlew javadoc" ignores "package.html" (so some packages lacks summary description in "package-summary.html" when building javadocs by Gradle). We might be able to make Gradle Javadoc task to properly handle "package.html" files with some options. Or, should we replace all "package.html" with "package-info.java" at this time? After Gradle generated Javadoc is fixed, we can return to here and complete this sub-task. > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9077) Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-9077: -- Attachment: LUCENE-9077-javadoc-locale-en-US.patch > Gradle build > > > Key: LUCENE-9077 > URL: https://issues.apache.org/jira/browse/LUCENE-9077 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Major > Fix For: master (9.0) > > Attachments: LUCENE-9077-javadoc-locale-en-US.patch > > Time Spent: 2.5h > Remaining Estimate: 0h > > This task focuses on providing gradle-based build equivalent for Lucene and > Solr (on master branch). See notes below on why this respin is needed. > The code lives on *gradle-master* branch. It is kept with sync with *master*. > Try running the following to see an overview of helper guides concerning > typical workflow, testing and ant-migration helpers: > gradlew :help > A list of items that needs to be added or requires work. If you'd like to > work on any of these, please add your name to the list. Once you have a > patch/ pull request let me (dweiss) know - I'll try to coordinate the merges. > * (/) Apply forbiddenAPIs > * (/) Generate hardware-aware gradle defaults for parallelism (count of > workers and test JVMs). > * (/) Fail the build if --tests filter is applied and no tests execute > during the entire build (this allows for an empty set of filtered tests at > single project level). > * (/) Port other settings and randomizations from common-build.xml > * (/) Configure security policy/ sandboxing for tests. > * (/) test's console output on -Ptests.verbose=true > * (/) add a :helpDeps explanation to how the dependency system works > (palantir plugin, lockfile) and how to retrieve structured information about > current dependencies of a given module (in a tree-like output). > * (/) jar checksums, jar checksum computation and validation. This should be > done without intermediate folders (directly on dependency sets). > * (/) verify min. JVM version and exact gradle version on build startup to > minimize odd build side-effects > * (/) Repro-line for failed tests/ runs. > * (/) add a top-level README note about building with gradle (and the > required JVM). > * (/) add an equivalent of 'validate-source-patterns' > (check-source-patterns.groovy) to precommit. > * (/) add an equivalent of 'rat-sources' to precommit. > * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) > to precommit. > * (/) javadoc compilation > Hard-to-implement stuff already investigated: > * (/) (done) -*Printing console output of failed tests.* There doesn't seem > to be any way to do this in a reasonably efficient way. There are onOutput > listeners but they're slow to operate and solr tests emit *tons* of output so > it's an overkill.- > * (!) (LUCENE-9120) *Tests working with security-debug logs or other > JVM-early log output*. Gradle's test runner works by redirecting Java's > stdout/ syserr so this just won't work. Perhaps we can spin the ant-based > test runner for such corner-cases. > Of lesser importance: > * Add an equivalent of 'documentation-lint" to precommit. > * (/) Do not require files to be committed before running precommit. (staged > files are fine). > * (/) add rendering of javadocs (gradlew javadoc) > * Attach javadocs to maven publications. > * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid > it'll be difficult to run it sensibly because gradle doesn't offer cwd > separation for the forked test runners. > * if you diff solr packaged distribution against ant-created distribution > there are minor differences in library versions and some JARs are excluded/ > moved around. I didn't try to force these as everything seems to work (tests, > etc.) – perhaps these differences should be fixed in the ant build instead. > * [EOE] identify and port various "regenerate" tasks from ant builds > (javacc, precompiled automata, etc.) > * Fill in POM details in gradle/defaults-maven.gradle so that they reflect > the previous content better (dependencies aside). > * Add any IDE integration layers that should be added (I use IntelliJ and it > imports the project out of the box, without the need for any special tuning). > * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; > currently XSLT...) > * I didn't bother adding Solr dist/test-framework to packaging (who'd use it > from a binary distribution? > * There is some python execution in check-broken-links and > check-missing-javadocs, not sure if it's been ported > * Nightly-smoke also have some python execution, not sure of the status. > * Precommit doesn't catch unused imports > > *{color:#ff}Note:{color}* this builds on the work
[jira] [Commented] (LUCENE-9077) Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031732#comment-17031732 ] Tomoko Uchida commented on LUCENE-9077: --- I found a JDK Javadoc tool related issue which was fixed on ant build on https://issues.apache.org/jira/browse/LUCENE-8738?focusedCommentId=16822659=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16822659. I attached the same workaround patch [^LUCENE-9077-javadoc-locale-en-US.patch] for graldle build. Will commit it soon. > Gradle build > > > Key: LUCENE-9077 > URL: https://issues.apache.org/jira/browse/LUCENE-9077 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Major > Fix For: master (9.0) > > Attachments: LUCENE-9077-javadoc-locale-en-US.patch > > Time Spent: 2.5h > Remaining Estimate: 0h > > This task focuses on providing gradle-based build equivalent for Lucene and > Solr (on master branch). See notes below on why this respin is needed. > The code lives on *gradle-master* branch. It is kept with sync with *master*. > Try running the following to see an overview of helper guides concerning > typical workflow, testing and ant-migration helpers: > gradlew :help > A list of items that needs to be added or requires work. If you'd like to > work on any of these, please add your name to the list. Once you have a > patch/ pull request let me (dweiss) know - I'll try to coordinate the merges. > * (/) Apply forbiddenAPIs > * (/) Generate hardware-aware gradle defaults for parallelism (count of > workers and test JVMs). > * (/) Fail the build if --tests filter is applied and no tests execute > during the entire build (this allows for an empty set of filtered tests at > single project level). > * (/) Port other settings and randomizations from common-build.xml > * (/) Configure security policy/ sandboxing for tests. > * (/) test's console output on -Ptests.verbose=true > * (/) add a :helpDeps explanation to how the dependency system works > (palantir plugin, lockfile) and how to retrieve structured information about > current dependencies of a given module (in a tree-like output). > * (/) jar checksums, jar checksum computation and validation. This should be > done without intermediate folders (directly on dependency sets). > * (/) verify min. JVM version and exact gradle version on build startup to > minimize odd build side-effects > * (/) Repro-line for failed tests/ runs. > * (/) add a top-level README note about building with gradle (and the > required JVM). > * (/) add an equivalent of 'validate-source-patterns' > (check-source-patterns.groovy) to precommit. > * (/) add an equivalent of 'rat-sources' to precommit. > * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) > to precommit. > * (/) javadoc compilation > Hard-to-implement stuff already investigated: > * (/) (done) -*Printing console output of failed tests.* There doesn't seem > to be any way to do this in a reasonably efficient way. There are onOutput > listeners but they're slow to operate and solr tests emit *tons* of output so > it's an overkill.- > * (!) (LUCENE-9120) *Tests working with security-debug logs or other > JVM-early log output*. Gradle's test runner works by redirecting Java's > stdout/ syserr so this just won't work. Perhaps we can spin the ant-based > test runner for such corner-cases. > Of lesser importance: > * Add an equivalent of 'documentation-lint" to precommit. > * (/) Do not require files to be committed before running precommit. (staged > files are fine). > * (/) add rendering of javadocs (gradlew javadoc) > * Attach javadocs to maven publications. > * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid > it'll be difficult to run it sensibly because gradle doesn't offer cwd > separation for the forked test runners. > * if you diff solr packaged distribution against ant-created distribution > there are minor differences in library versions and some JARs are excluded/ > moved around. I didn't try to force these as everything seems to work (tests, > etc.) – perhaps these differences should be fixed in the ant build instead. > * [EOE] identify and port various "regenerate" tasks from ant builds > (javacc, precompiled automata, etc.) > * Fill in POM details in gradle/defaults-maven.gradle so that they reflect > the previous content better (dependencies aside). > * Add any IDE integration layers that should be added (I use IntelliJ and it > imports the project out of the box, without the need for any special tuning). > * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; > currently XSLT...) > * I didn't bother adding Solr dist/test-framework to
[jira] [Commented] (LUCENE-9077) Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028347#comment-17028347 ] Tomoko Uchida commented on LUCENE-9077: --- Hi [~dweiss], {quote}Add an equivalent of 'documentation-lint" to precommit. {quote} what's the current status of that? I just started to try to port the "documentation-lint" task on my local branch : [https://github.com/mocobeta/lucene-solr-mirror/commit/7adc390183b10ea1b64fded000a87900853cf912] I'm not sure if it is still significant task to Gradle build. Would it be any help here? > Gradle build > > > Key: LUCENE-9077 > URL: https://issues.apache.org/jira/browse/LUCENE-9077 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Major > Fix For: master (9.0) > > Time Spent: 2.5h > Remaining Estimate: 0h > > This task focuses on providing gradle-based build equivalent for Lucene and > Solr (on master branch). See notes below on why this respin is needed. > The code lives on *gradle-master* branch. It is kept with sync with *master*. > Try running the following to see an overview of helper guides concerning > typical workflow, testing and ant-migration helpers: > gradlew :help > A list of items that needs to be added or requires work. If you'd like to > work on any of these, please add your name to the list. Once you have a > patch/ pull request let me (dweiss) know - I'll try to coordinate the merges. > * (/) Apply forbiddenAPIs > * (/) Generate hardware-aware gradle defaults for parallelism (count of > workers and test JVMs). > * (/) Fail the build if --tests filter is applied and no tests execute > during the entire build (this allows for an empty set of filtered tests at > single project level). > * (/) Port other settings and randomizations from common-build.xml > * (/) Configure security policy/ sandboxing for tests. > * (/) test's console output on -Ptests.verbose=true > * (/) add a :helpDeps explanation to how the dependency system works > (palantir plugin, lockfile) and how to retrieve structured information about > current dependencies of a given module (in a tree-like output). > * (/) jar checksums, jar checksum computation and validation. This should be > done without intermediate folders (directly on dependency sets). > * (/) verify min. JVM version and exact gradle version on build startup to > minimize odd build side-effects > * (/) Repro-line for failed tests/ runs. > * (/) add a top-level README note about building with gradle (and the > required JVM). > * (/) add an equivalent of 'validate-source-patterns' > (check-source-patterns.groovy) to precommit. > * (/) add an equivalent of 'rat-sources' to precommit. > * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) > to precommit. > * (/) javadoc compilation > Hard-to-implement stuff already investigated: > * (/) (done) -*Printing console output of failed tests.* There doesn't seem > to be any way to do this in a reasonably efficient way. There are onOutput > listeners but they're slow to operate and solr tests emit *tons* of output so > it's an overkill.- > * (!) (LUCENE-9120) *Tests working with security-debug logs or other > JVM-early log output*. Gradle's test runner works by redirecting Java's > stdout/ syserr so this just won't work. Perhaps we can spin the ant-based > test runner for such corner-cases. > Of lesser importance: > * Add an equivalent of 'documentation-lint" to precommit. > * (/) Do not require files to be committed before running precommit. (staged > files are fine). > * (/) add rendering of javadocs (gradlew javadoc) > * Attach javadocs to maven publications. > * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid > it'll be difficult to run it sensibly because gradle doesn't offer cwd > separation for the forked test runners. > * if you diff solr packaged distribution against ant-created distribution > there are minor differences in library versions and some JARs are excluded/ > moved around. I didn't try to force these as everything seems to work (tests, > etc.) – perhaps these differences should be fixed in the ant build instead. > * [EOE] identify and port various "regenerate" tasks from ant builds > (javacc, precompiled automata, etc.) > * Fill in POM details in gradle/defaults-maven.gradle so that they reflect > the previous content better (dependencies aside). > * Add any IDE integration layers that should be added (I use IntelliJ and it > imports the project out of the box, without the need for any special tuning). > * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; > currently XSLT...) > * I didn't bother adding Solr dist/test-framework to packaging (who'd use it > from a binary
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032601#comment-17032601 ] Tomoko Uchida commented on LUCENE-9201: --- Thank you [~rcmuir] for your work and comments. I updated the PR (refactored the gradle tasks and ported ant build details, as much as I can). I hope it is a good starting point, if not perfect. There are still not ported the ant scripts' hacks, especially around "ecj-macro" stuff, that I cannot figure out how to copy to gradle. > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029841#comment-17029841 ] Tomoko Uchida commented on LUCENE-9004: --- {quote} Unfortunately, I couldn't obtain the corresponding results of HNSW due to the out of memory error in my PC. {quote} Current HNSW implementation requires 4GB heap for 1M dataset / 200 dimension vectors (we need to reduce the memory consumption). The default heap size that is given to Java processes depends on platforms, but for most commodity PCs it wouldn't be so large so you will see OOM if you are not set the -Xmx JVM arg. > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Attachments: hnsw_layered_graph.png > > Time Spent: 3h 10m > Remaining Estimate: 0h > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The implementation I propose for Lucene is as follows. We need two new data > structures to encode the vectors and the graph. We can encode vectors using a > light wrapper around {{BinaryDocValues}} (we also want to encode the vector > dimension and have efficient conversion from bytes to floats). For the graph > we can use {{SortedNumericDocValues}} where the values we encode are the > docids of the related documents. Encoding the interdocument relations using > docids directly will make it relatively fast to traverse the graph since we > won't need to lookup through an id-field indirection. This choice limits us > to building a graph-per-segment since it would be impractical to maintain a > global graph for the whole index in the face of segment merges. However > graph-per-segment is a very natural at search time - we can traverse each > segments' graph independently and merge results as we do today for term-based > search. > At index time, however, merging graphs is somewhat challenging. While > indexing we build a graph incrementally, performing searches to construct > links among neighbors. When merging segments we must construct a new graph > containing elements of all the merged segments. Ideally we would somehow > preserve the work done when building the initial graphs, but at least as a > start I'd propose we construct a new graph from scratch when merging. The > process is going to be limited, at least initially, to graphs that can fit > in RAM since we require random access to the entire graph while constructing > it: In order to add links bidirectionally we must continually update existing > documents. > I think we want to express this API to users as a single joint > {{KnnGraphField}} abstraction that joins together the vectors and the graph > as a single joint field type. Mostly it just looks like a vector-valued > field, but has this graph attached to it. > I'll push a branch with my POC and would love to hear comments. It has many > nocommits, basic design is not really set, there is no Query implementation > and no integration iwth IndexSearcher, but it does work by some measure using > a standalone test class. I've tested with uniform random vectors and on my > laptop indexed
[jira] [Created] (LUCENE-9201) Port documentation-lint task to Gradle build
Tomoko Uchida created LUCENE-9201: - Summary: Port documentation-lint task to Gradle build Key: LUCENE-9201 URL: https://issues.apache.org/jira/browse/LUCENE-9201 Project: Lucene - Core Issue Type: Sub-task Affects Versions: master (9.0) Reporter: Tomoko Uchida Assignee: Tomoko Uchida Ant build's "documentation-lint" target consists of those two sub targets. - "-ecj-javadoc-lint" (Javadoc linting by ECJ) - "-documentation-lint"(Missing javadocs / broken links check by python scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-9201: -- Description: Ant build's "documentation-lint" target consists of those two sub targets. * "-ecj-javadoc-lint" (Javadoc linting by ECJ) * "-documentation-lint"(Missing javadocs / broken links check by python scripts) was: Ant build's "documentation-lint" target consists of those two sub targets. - "-ecj-javadoc-lint" (Javadoc linting by ECJ) - "-documentation-lint"(Missing javadocs / broken links check by python scripts) > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9077) Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028665#comment-17028665 ] Tomoko Uchida commented on LUCENE-9077: --- [~erickerickson] thanks for your comment. I opened a sub-task: LUCENE-9201 > Gradle build > > > Key: LUCENE-9077 > URL: https://issues.apache.org/jira/browse/LUCENE-9077 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Major > Fix For: master (9.0) > > Time Spent: 2.5h > Remaining Estimate: 0h > > This task focuses on providing gradle-based build equivalent for Lucene and > Solr (on master branch). See notes below on why this respin is needed. > The code lives on *gradle-master* branch. It is kept with sync with *master*. > Try running the following to see an overview of helper guides concerning > typical workflow, testing and ant-migration helpers: > gradlew :help > A list of items that needs to be added or requires work. If you'd like to > work on any of these, please add your name to the list. Once you have a > patch/ pull request let me (dweiss) know - I'll try to coordinate the merges. > * (/) Apply forbiddenAPIs > * (/) Generate hardware-aware gradle defaults for parallelism (count of > workers and test JVMs). > * (/) Fail the build if --tests filter is applied and no tests execute > during the entire build (this allows for an empty set of filtered tests at > single project level). > * (/) Port other settings and randomizations from common-build.xml > * (/) Configure security policy/ sandboxing for tests. > * (/) test's console output on -Ptests.verbose=true > * (/) add a :helpDeps explanation to how the dependency system works > (palantir plugin, lockfile) and how to retrieve structured information about > current dependencies of a given module (in a tree-like output). > * (/) jar checksums, jar checksum computation and validation. This should be > done without intermediate folders (directly on dependency sets). > * (/) verify min. JVM version and exact gradle version on build startup to > minimize odd build side-effects > * (/) Repro-line for failed tests/ runs. > * (/) add a top-level README note about building with gradle (and the > required JVM). > * (/) add an equivalent of 'validate-source-patterns' > (check-source-patterns.groovy) to precommit. > * (/) add an equivalent of 'rat-sources' to precommit. > * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) > to precommit. > * (/) javadoc compilation > Hard-to-implement stuff already investigated: > * (/) (done) -*Printing console output of failed tests.* There doesn't seem > to be any way to do this in a reasonably efficient way. There are onOutput > listeners but they're slow to operate and solr tests emit *tons* of output so > it's an overkill.- > * (!) (LUCENE-9120) *Tests working with security-debug logs or other > JVM-early log output*. Gradle's test runner works by redirecting Java's > stdout/ syserr so this just won't work. Perhaps we can spin the ant-based > test runner for such corner-cases. > Of lesser importance: > * Add an equivalent of 'documentation-lint" to precommit. > * (/) Do not require files to be committed before running precommit. (staged > files are fine). > * (/) add rendering of javadocs (gradlew javadoc) > * Attach javadocs to maven publications. > * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid > it'll be difficult to run it sensibly because gradle doesn't offer cwd > separation for the forked test runners. > * if you diff solr packaged distribution against ant-created distribution > there are minor differences in library versions and some JARs are excluded/ > moved around. I didn't try to force these as everything seems to work (tests, > etc.) – perhaps these differences should be fixed in the ant build instead. > * [EOE] identify and port various "regenerate" tasks from ant builds > (javacc, precompiled automata, etc.) > * Fill in POM details in gradle/defaults-maven.gradle so that they reflect > the previous content better (dependencies aside). > * Add any IDE integration layers that should be added (I use IntelliJ and it > imports the project out of the box, without the need for any special tuning). > * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; > currently XSLT...) > * I didn't bother adding Solr dist/test-framework to packaging (who'd use it > from a binary distribution? > > *{color:#ff}Note:{color}* this builds on the work done by Mark Miller and > Cao Mạnh Đạt but also applies lessons learned from those two efforts: > * *Do not try to do too many things at once*. If we deviate too far from > master, the branch will be hard to merge. > * *Do everything
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028662#comment-17028662 ] Tomoko Uchida commented on LUCENE-9201: --- This is WIP branch: [https://github.com/mocobeta/lucene-solr-mirror/commit/7adc390183b10ea1b64fded000a87900853cf912] To get started I'm trying to port the ECJ task. [The compiler|https://help.eclipse.org/2019-03/index.jsp?topic=%2Forg.eclipse.jdt.doc.user%2Ftasks%2Ftask-using_batch_compiler.htm] seems to work by retrieving dependencies from each sub project "configurations", but WARNINGS are suppressed (Ant build outputs a lot "ecj-lint" warnings). I twisted Gradle logger level for the STDOUT, that didn't work for me. > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-9123: -- Fix Version/s: 8.5 master (9.0) > JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter > --- > > Key: LUCENE-9123 > URL: https://issues.apache.org/jira/browse/LUCENE-9123 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Kazuaki Hiraga >Assignee: Tomoko Uchida >Priority: Major > Fix For: master (9.0), 8.5 > > Attachments: LUCENE-9123.patch, LUCENE-9123_8x.patch > > > JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with > both of SynonymGraphFilter and SynonymFilter when JT generates multiple > tokens as an output. If we use `mode=normal`, it should be fine. However, we > would like to use decomposed tokens that can maximize to chance to increase > recall. > Snippet of schema: > {code:xml} > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > > > synonyms="lang/synonyms_ja.txt" > tokenizerFactory="solr.JapaneseTokenizerFactory"/> > > > tags="lang/stoptags_ja.txt" /> > > > > > > minimumLength="4"/> > > > > > {code} > An synonym entry that generates error: > {noformat} > 株式会社,コーポレーション > {noformat} > The following is an output on console: > {noformat} > $ ./bin/solr create_core -c jp_test -d ../config/solrconfs > ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] > Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 > (got: 0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida resolved LUCENE-9123. --- Resolution: Fixed Merged the patches into the master (with a MIGRATE notice) and branch_8x. Thanks [~h.kazuaki]! > JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter > --- > > Key: LUCENE-9123 > URL: https://issues.apache.org/jira/browse/LUCENE-9123 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Kazuaki Hiraga >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9123.patch, LUCENE-9123_8x.patch > > > JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with > both of SynonymGraphFilter and SynonymFilter when JT generates multiple > tokens as an output. If we use `mode=normal`, it should be fine. However, we > would like to use decomposed tokens that can maximize to chance to increase > recall. > Snippet of schema: > {code:xml} > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > > > synonyms="lang/synonyms_ja.txt" > tokenizerFactory="solr.JapaneseTokenizerFactory"/> > > > tags="lang/stoptags_ja.txt" /> > > > > > > minimumLength="4"/> > > > > > {code} > An synonym entry that generates error: > {noformat} > 株式会社,コーポレーション > {noformat} > The following is an output on console: > {noformat} > $ ./bin/solr create_core -c jp_test -d ../config/solrconfs > ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] > Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 > (got: 0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-9123: -- Affects Version/s: (was: 8.4) > JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter > --- > > Key: LUCENE-9123 > URL: https://issues.apache.org/jira/browse/LUCENE-9123 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Kazuaki Hiraga >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9123.patch, LUCENE-9123_8x.patch > > > JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with > both of SynonymGraphFilter and SynonymFilter when JT generates multiple > tokens as an output. If we use `mode=normal`, it should be fine. However, we > would like to use decomposed tokens that can maximize to chance to increase > recall. > Snippet of schema: > {code:xml} > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > > > synonyms="lang/synonyms_ja.txt" > tokenizerFactory="solr.JapaneseTokenizerFactory"/> > > > tags="lang/stoptags_ja.txt" /> > > > > > > minimumLength="4"/> > > > > > {code} > An synonym entry that generates error: > {noformat} > 株式会社,コーポレーション > {noformat} > The following is an output on console: > {noformat} > $ ./bin/solr create_core -c jp_test -d ../config/solrconfs > ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] > Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 > (got: 0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041159#comment-17041159 ] Tomoko Uchida commented on LUCENE-9201: --- Hi, there remain two documentation-lint tasks to be ported. I opened for a pull request for the easy one - "check missing javadocs" task that can be defined on each sub project. [https://github.com/apache/lucene-solr/pull/1267] This is functionally same to my previous pull request, but I rewrote it in a bit more declarative manner. It depends on gradle default Javadoc task for now, I think the basic logic can be applied when we switch to our custom javadoc task. Could you review it or give some comments for this? > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, > javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 2h 50m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041449#comment-17041449 ] Tomoko Uchida commented on LUCENE-9201: --- Hi Dawid, thanks for your comments! I think I get the points, I will update the merge request and notify you. > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, > javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9155) Port Kuromoji dictionary compilation (regenerate)
[ https://issues.apache.org/jira/browse/LUCENE-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041549#comment-17041549 ] Tomoko Uchida commented on LUCENE-9155: --- Actually 'naist' here is not a dictionary format/type but it's all about data (word entries and the language model). So technically it should be able to be built by the same logic for the 'ipadic' (default). I agree with that current ant script for switching dictionary sources isn't great. Also the 'naist' dictionary is no longer largely used for practical purpose as far as I know, though it has historical significance in academic area. I think we can skip 'naist' dictionary itself for now, and focus on providing better way to use/build alternative dictionary (along with LUCENE-8816, I would like to restart it when we complete the migration to gradle build). > Port Kuromoji dictionary compilation (regenerate) > - > > Key: LUCENE-9155 > URL: https://issues.apache.org/jira/browse/LUCENE-9155 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Attachments: kuromoji.patch > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9155) Port Kuromoji dictionary compilation (regenerate)
[ https://issues.apache.org/jira/browse/LUCENE-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041571#comment-17041571 ] Tomoko Uchida commented on LUCENE-9155: --- If we drop/skip the build for the 'naist' dictionary for the next major release, I think we have to mention about it in Changes though I am not sure there is someone who cares about it... > Port Kuromoji dictionary compilation (regenerate) > - > > Key: LUCENE-9155 > URL: https://issues.apache.org/jira/browse/LUCENE-9155 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Attachments: kuromoji.patch > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034462#comment-17034462 ] Tomoko Uchida edited comment on LUCENE-9201 at 2/11/20 2:00 PM: Just from curiosity, I roughly assessed the penalty of "copying all javadocs" on my PC (Core i7-8700 cpu, fedora). // when the javadocs is already collected into one directory, and just run the python linter {code:java} $ ./gradlew checkBrokenLinks BUILD FAILED in 29s {code} // first collect javadocs into a directory on HDD (7200rpm), and run the python linter {code:java} $ ./gradlew checkBrokenLinks BUILD FAILED in 37s {code} When I did same thing on NVMe SSD, there was almost no penalty, just in case. (note: "BUILD FAILED" is an expected result for now.) was (Author: tomoko uchida): Just from curiosity, I roughly assessed the penalty of "copying all javadocs" on my PC (Core i7-8700 cpu, fedora). // when the javadocs is already collected into one directory, and just run the python linter {code:java} $ ./gradlew checkBrokenLinks BUILD FAILED in 29s {code} // first collect javadocs into a directory on HDD (7200rpm), and run the python linter {code:java} $ ./gradlew checkBrokenLinks BUILD FAILED in 37s {code} When I did same thing on NVMe SSD, there was almost no penalty, just in case. > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034462#comment-17034462 ] Tomoko Uchida commented on LUCENE-9201: --- Just from curiosity, I roughly assessed the penalty of "copying all javadocs" on my PC (Core i7-8700 cpu, fedora). // when the javadocs is already collected into one directory, and just run the python linter {code:java} $ ./gradlew checkBrokenLinks BUILD FAILED in 29s {code} // first collect javadocs into a directory on HDD (7200rpm), and run the python linter {code:java} $ ./gradlew checkBrokenLinks BUILD FAILED in 37s {code} When I did same thing on NVMe SSD, there was almost no penalty, just in case. > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034375#comment-17034375 ] Tomoko Uchida commented on LUCENE-9201: --- {quote}It is my personal preference to have a project-scope granularity. This way you can run project-scoped task (like gradlew -p lucene/core javadoc). My personal take on assembling "distributions" is to have a separate project that just takes what it needs from other projects and puts it together (with any tweaks required). This makes it easier to reason about how a distribution is assembled and from where, while each project just takes care of itself. {quote} I get it, have no objection. Let's keep the javadocs output as is and gather them when it's needed. We might want an independent task to gather all javadocs, though I did it in the "checking broken links" task for the present. {quote}Let me look at the patch again later today (digging myself out of the vacation hole). {quote} Thank you, I updated the PR according to comments from Uwe Schindler. Also I added "// FIXME" comments to some code that don't work well or need to be fixed. > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 50m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038257#comment-17038257 ] Tomoko Uchida commented on LUCENE-9201: --- Thanks, I got the current status. I will try to create a patch for [LUCENE-9219]. > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9201-ecj.patch, javadocGRADLE.png, > javadocHTML4.png, javadocHTML5.png > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037788#comment-17037788 ] Tomoko Uchida commented on LUCENE-9201: --- [~dweiss] Thank you for your guidance and detailed comments in the patch! I applied this [^LUCENE-9201-ecj.patch] to my branch and ran the {{ecjLint}} task, then found the task seemed to finish in one second without executing the ECJ's Main class. (I intentionally added an unused import to o.a.l.a.Analyzer but that wasn't detected.) {code:java} $ ./gradlew :lucene:core:ecjLint BUILD SUCCESSFUL in 1s 4 actionable tasks: 4 up-to-date {code} I have not found any problems in the task definition (and not yet studied deeply how it works), but am I missing something? With my previous patch [https://github.com/apache/lucene-solr/pull/1242/files], for example, the {{:lucene:core:ecjLint}} takes about five seconds and (correctly) fails with this failure message if there are unused imports. {code:java} $ ./gradlew :lucene:core:ecjLint > Task :lucene:core:ecjLint -- 1. ERROR in /mnt/hdd/repo/lucene-solr-mirror/lucene/core/src/java/org/apache/lucene/analysis/Analyzer.java (at line 27) import java.util.ArrayList; ^^^ The import java.util.ArrayList is never used -- 1 problem (1 error) > Task :lucene:core:ecjLint FAILED FAILURE: Build failed with an exception. * Where: Script '/mnt/hdd/repo/lucene-solr-mirror/gradle/validation/documentation-lint.gradle' line: 130 * What went wrong: Execution failed for task ':lucene:core:ecjLint'. > Process 'command '/usr/local/java/adoptopenjdk/jdk-11.0.3+7/bin/java'' > finished with non-zero exit value 255 * Try: Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights. * Get more help at https://help.gradle.org BUILD FAILED in 4s {code} I will look a bit more closely at why the ecjLint task doesn't work for me... > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9201-ecj.patch, javadocGRADLE.png, > javadocHTML4.png, javadocHTML5.png > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042900#comment-17042900 ] Tomoko Uchida commented on LUCENE-9201: --- [~dweiss] I did some tests with the patch [^LUCENE-9201.patch] and noticed this seems to cause an error with :solr:server project, where the {{java-libarry}} plugin is applied but there is actually no Java source - hence no Javadocs folder. {code:java} lucene-solr $ ./gradlew :solr:server:checkMissingDocs > Task :solr:server:checkMissingDocsDefault FAILED FAILURE: Build failed with an exception. * Where: Script '/mnt/hdd/repo/lucene-solr/gradle/validation/missing-docs-check.gradle' line: 105 * What went wrong: Execution failed for task ':solr:server:checkMissingDocsDefault'. > Javadoc verification failed: Traceback (most recent call last): File "/mnt/hdd/repo/lucene-solr/dev-tools/scripts/checkJavaDocs.py", line 388, in if checkPackageSummaries(sys.argv[1], level): File "/mnt/hdd/repo/lucene-solr/dev-tools/scripts/checkJavaDocs.py", line 351, in checkPackageSummaries checkClassSummaries(root) File "/mnt/hdd/repo/lucene-solr/dev-tools/scripts/checkJavaDocs.py", line 154, in checkClassSummaries f = open(fullPath, encoding='UTF-8') FileNotFoundError: [Errno 2] No such file or directory: '/mnt/hdd/repo/lucene-solr/solr/server/build/docs/javadoc' {code} How can we properly exclude such irregular projects? This workaround works for me, does this make sence...? {code:java} @TaskAction def lint() { -dirs.each { dir -> +//dirs.each { dir -> +dirs.findAll { project.file(it).exists() }.each { dir -> project.logger.info("Checking for missing docs... (dir=${dir}, level=${level})") checkMissingJavadocs(dir, level) } {code} > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, > LUCENE-9201.patch, javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 4h > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042900#comment-17042900 ] Tomoko Uchida edited comment on LUCENE-9201 at 2/23/20 11:11 AM: - [~dweiss] I did some tests with the patch [^LUCENE-9201.patch] and noticed this seems to cause an error with :solr:server project, where the {{java-libarry}} plugin is applied but there is actually no Java source - hence no Javadocs folder. {code:java} lucene-solr $ ./gradlew :solr:server:checkMissingDocs > Task :solr:server:checkMissingDocsDefault FAILED FAILURE: Build failed with an exception. * Where: Script '/mnt/hdd/repo/lucene-solr/gradle/validation/missing-docs-check.gradle' line: 105 * What went wrong: Execution failed for task ':solr:server:checkMissingDocsDefault'. > Javadoc verification failed: Traceback (most recent call last): File "/mnt/hdd/repo/lucene-solr/dev-tools/scripts/checkJavaDocs.py", line 388, in if checkPackageSummaries(sys.argv[1], level): File "/mnt/hdd/repo/lucene-solr/dev-tools/scripts/checkJavaDocs.py", line 351, in checkPackageSummaries checkClassSummaries(root) File "/mnt/hdd/repo/lucene-solr/dev-tools/scripts/checkJavaDocs.py", line 154, in checkClassSummaries f = open(fullPath, encoding='UTF-8') FileNotFoundError: [Errno 2] No such file or directory: '/mnt/hdd/repo/lucene-solr/solr/server/build/docs/javadoc' {code} How can we properly exclude such irregular projects? This workaround works for me, does this make sence...? {code:java} @TaskAction def lint() { -dirs.each { dir -> +//dirs.each { dir -> +dirs.findAll { it.exists() }.each { dir -> project.logger.info("Checking for missing docs... (dir=${dir}, level=${level})") checkMissingJavadocs(dir, level) } {code} was (Author: tomoko uchida): [~dweiss] I did some tests with the patch [^LUCENE-9201.patch] and noticed this seems to cause an error with :solr:server project, where the {{java-libarry}} plugin is applied but there is actually no Java source - hence no Javadocs folder. {code:java} lucene-solr $ ./gradlew :solr:server:checkMissingDocs > Task :solr:server:checkMissingDocsDefault FAILED FAILURE: Build failed with an exception. * Where: Script '/mnt/hdd/repo/lucene-solr/gradle/validation/missing-docs-check.gradle' line: 105 * What went wrong: Execution failed for task ':solr:server:checkMissingDocsDefault'. > Javadoc verification failed: Traceback (most recent call last): File "/mnt/hdd/repo/lucene-solr/dev-tools/scripts/checkJavaDocs.py", line 388, in if checkPackageSummaries(sys.argv[1], level): File "/mnt/hdd/repo/lucene-solr/dev-tools/scripts/checkJavaDocs.py", line 351, in checkPackageSummaries checkClassSummaries(root) File "/mnt/hdd/repo/lucene-solr/dev-tools/scripts/checkJavaDocs.py", line 154, in checkClassSummaries f = open(fullPath, encoding='UTF-8') FileNotFoundError: [Errno 2] No such file or directory: '/mnt/hdd/repo/lucene-solr/solr/server/build/docs/javadoc' {code} How can we properly exclude such irregular projects? This workaround works for me, does this make sence...? {code:java} @TaskAction def lint() { -dirs.each { dir -> +//dirs.each { dir -> +dirs.findAll { project.file(it).exists() }.each { dir -> project.logger.info("Checking for missing docs... (dir=${dir}, level=${level})") checkMissingJavadocs(dir, level) } {code} > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, > LUCENE-9201.patch, javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 4h > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9242) Gradle Javadoc task does not include package summaries
Tomoko Uchida created LUCENE-9242: - Summary: Gradle Javadoc task does not include package summaries Key: LUCENE-9242 URL: https://issues.apache.org/jira/browse/LUCENE-9242 Project: Lucene - Core Issue Type: Sub-task Components: general/javadocs Affects Versions: master (9.0) Reporter: Tomoko Uchida "javadoc" task for the Gradle build does not correctly output package summaries, since it ignores "package.html" file in the source tree (so the Python linter {{checkJavaDocs.py}} detects that and fails for now.) See for more details: LUCENE-9201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043140#comment-17043140 ] Tomoko Uchida commented on LUCENE-9201: --- I just merged [^LUCENE-9201-missing-docs.patch] into the master. It would be a bit hard to maintain the python script (it relies on Javadoc HTML details) ... but the linter will do its work, until it is replaced with LUCENE-9215. Also, I opened LUCENE-9242 to fix Javadocs for gradle build - should we directly call the javadoc tool, instead of using Gradle's default one? > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, > LUCENE-9201-missing-docs.patch, LUCENE-9201.patch, javadocGRADLE.png, > javadocHTML4.png, javadocHTML5.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037404#comment-17037404 ] Tomoko Uchida commented on LUCENE-9004: --- {quote}It can readily be shown that HNSW performs much better in query time. But I was surprised that top 1 in-set recall percent of HNSW is so low. It shouldn't be a problem of algorithm itself, but more likely a problem of implementation or test code. I will check it this weekend. {quote} Thanks [~irvingzhang] for measuring. I noticed I might have made a very basic mistake when comparing neighborhood nodes, maybe some inequality signs should be flipped :/ I will do recall performance tests with GloVE and fix the bugs. > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Attachments: hnsw_layered_graph.png > > Time Spent: 3h 20m > Remaining Estimate: 0h > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The implementation I propose for Lucene is as follows. We need two new data > structures to encode the vectors and the graph. We can encode vectors using a > light wrapper around {{BinaryDocValues}} (we also want to encode the vector > dimension and have efficient conversion from bytes to floats). For the graph > we can use {{SortedNumericDocValues}} where the values we encode are the > docids of the related documents. Encoding the interdocument relations using > docids directly will make it relatively fast to traverse the graph since we > won't need to lookup through an id-field indirection. This choice limits us > to building a graph-per-segment since it would be impractical to maintain a > global graph for the whole index in the face of segment merges. However > graph-per-segment is a very natural at search time - we can traverse each > segments' graph independently and merge results as we do today for term-based > search. > At index time, however, merging graphs is somewhat challenging. While > indexing we build a graph incrementally, performing searches to construct > links among neighbors. When merging segments we must construct a new graph > containing elements of all the merged segments. Ideally we would somehow > preserve the work done when building the initial graphs, but at least as a > start I'd propose we construct a new graph from scratch when merging. The > process is going to be limited, at least initially, to graphs that can fit > in RAM since we require random access to the entire graph while constructing > it: In order to add links bidirectionally we must continually update existing > documents. > I think we want to express this API to users as a single joint > {{KnnGraphField}} abstraction that joins together the vectors and the graph > as a single joint field type. Mostly it just looks like a vector-valued > field, but has this graph attached to it. > I'll push a branch with my POC and would love to hear comments. It has many > nocommits, basic design is not really set, there is no Query implementation > and no integration iwth IndexSearcher, but it does work by some measure using > a
[jira] [Comment Edited] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039137#comment-17039137 ] Tomoko Uchida edited comment on LUCENE-9201 at 2/18/20 2:59 PM: Yes, the patch works for me too, with this modification. It's not explicitly documented, but {{sourceSet.getTaskName("ecjLint", null)}} generates "ecjLintMain" for main sourceSet and "ecjLintTest" for test sourceSet (on Gradle version 6.0). {code:java} //tasks.create("${sourceSet.name}EcjLint", JavaExec, { tasks.create(sourceSet.getTaskName("ecjLint", null), JavaExec, { {code} Can I ask a few more questions before creating a patch? 1. Even when I commented out those lines in gradle.build, ":solr:solr-ref-guide:ecjLint" finished without doing nothing (linting was safely skipped by the changes in solr/solr-ref-guide/build.gradle, if my understanding is correct). Is this configuration still needed? {code:java} // This excludes solr-ref-guide from the check (excludes are not taken into account // and linting of the ant-based task fails. configure(project(":solr:solr-ref-guide")) { afterEvaluate { project.tasks.findByPath("mainEcjLint").enabled = false } } {code} 2. Currently all other check tasks are grouped in "Verification", so would it be better to change the task group name "validation" to "Verification"? {code:java} $ ./gradles tasks ... Validation tasks ecjLint - Lint Java sources using ECJ. Verification tasks -- check - Runs all checks. checkUnusedConstraints - Ensures all versions in your versions.props correspond to an actual gradle dependency forbiddenApis - Runs forbidden-apis checks. owasp - Check project dependencies against OWASP vulnerability database. rat - Runs Apache Rat checks. test - Runs the unit tests. verifyLocks - Verifies that your versions.lock is up to date {code} was (Author: tomoko uchida): Yes, the patch works for me too, with this modification. It's not explicitly documented, but {{sourceSet.getTaskName("ecjLint", null)}} generates "ecjLintMain" for main sourceSet and "ecjLintTest" for test sourceSet (on Gradle version 6.0). {code:java} //tasks.create("${sourceSet.name}EcjLint", JavaExec, { tasks.create(sourceSet.getTaskName("ecjLint", null), JavaExec, { {code} Can I ask a few questions before creating a patch? 1. When I commented out those lines in gradle.build, ":solr:solr-ref-guide:ecjLint" finished without doing nothing (linting was safely skipped by the changes in solr/solr-ref-guide/build.gradle, if my understanding is correct). Is this configuration still needed? {code:java} // This excludes solr-ref-guide from the check (excludes are not taken into account // and linting of the ant-based task fails. configure(project(":solr:solr-ref-guide")) { afterEvaluate { project.tasks.findByPath("mainEcjLint").enabled = false } } {code} 2. Currently all other check tasks are grouped in "Verification", so would it be better to change the task group name "validation" to "Verification"? {code:java} $ ./gradles tasks ... Validation tasks ecjLint - Lint Java sources using ECJ. Verification tasks -- check - Runs all checks. checkUnusedConstraints - Ensures all versions in your versions.props correspond to an actual gradle dependency forbiddenApis - Runs forbidden-apis checks. owasp - Check project dependencies against OWASP vulnerability database. rat - Runs Apache Rat checks. test - Runs the unit tests. verifyLocks - Verifies that your versions.lock is up to date {code} > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, > javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9219) Port ECJ-based linter to gradle
[ https://issues.apache.org/jira/browse/LUCENE-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-9219: -- Attachment: LUCENE-9219.patch > Port ECJ-based linter to gradle > --- > > Key: LUCENE-9219 > URL: https://issues.apache.org/jira/browse/LUCENE-9219 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Priority: Major > Attachments: LUCENE-9219.patch, LUCENE-9219.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9219) Port ECJ-based linter to gradle
[ https://issues.apache.org/jira/browse/LUCENE-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039190#comment-17039190 ] Tomoko Uchida commented on LUCENE-9219: --- Hi [~dweiss], I attached [^LUCENE-9219.patch], a modified version of the patch in LUCENE-9201. Could you take a look at this? (diffs from LUCENE-9201-ecj-2.patch) - use sourceSet.getTaskName() - task group was changed to "Verification" - the artifact version of "org.eclipse.jdt:ecj" was moved to "build.gradle" from "ecj-lint.gradle" Also I excluded the changes for "lucene/common-build.xml", because I thought this was added to the patch accidentally. Please let me know if this should be included again. {code:java} diff --git a/lucene/common-build.xml b/lucene/common-build.xml index 1e3da88250b..ca5887df550 100644 --- a/lucene/common-build.xml +++ b/lucene/common-build.xml @@ -2034,6 +2034,7 @@ ${ant.project.name}.test.dependencies=${test.classpath.list} + {code} > Port ECJ-based linter to gradle > --- > > Key: LUCENE-9219 > URL: https://issues.apache.org/jira/browse/LUCENE-9219 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Priority: Major > Attachments: LUCENE-9219.patch, LUCENE-9219.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039137#comment-17039137 ] Tomoko Uchida commented on LUCENE-9201: --- Yes, the patch works for me too, with this modification. It's not explicitly documented, but {{sourceSet.getTaskName("ecjLint", null)}} generates "ecjLintMain" for main sourceSet and "ecjLintTest" for test sourceSet (on Gradle version 6.0). {code:java} //tasks.create("${sourceSet.name}EcjLint", JavaExec, { tasks.create(sourceSet.getTaskName("ecjLint", null), JavaExec, { {code} Can I ask a few questions before creating a patch? 1. When I commented out those lines in gradle.build, ":solr:solr-ref-guide:ecjLint" finished without doing nothing (linting was safely skipped by the changes in solr/solr-ref-guide/build.gradle, if my understanding is correct). Is this configuration still needed? {code:java} // This excludes solr-ref-guide from the check (excludes are not taken into account // and linting of the ant-based task fails. configure(project(":solr:solr-ref-guide")) { afterEvaluate { project.tasks.findByPath("mainEcjLint").enabled = false } } {code} 2. Currently all other check tasks are grouped in "Verification", so would it be better to change the task group name "validation" to "Verification"? {code:java} $ ./gradles tasks ... Validation tasks ecjLint - Lint Java sources using ECJ. Verification tasks -- check - Runs all checks. checkUnusedConstraints - Ensures all versions in your versions.props correspond to an actual gradle dependency forbiddenApis - Runs forbidden-apis checks. owasp - Check project dependencies against OWASP vulnerability database. rat - Runs Apache Rat checks. test - Runs the unit tests. verifyLocks - Verifies that your versions.lock is up to date {code} > Port documentation-lint task to Gradle build > > > Key: LUCENE-9201 > URL: https://issues.apache.org/jira/browse/LUCENE-9201 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: master (9.0) >Reporter: Tomoko Uchida >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, > javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Ant build's "documentation-lint" target consists of those two sub targets. > * "-ecj-javadoc-lint" (Javadoc linting by ECJ) > * "-documentation-lint"(Missing javadocs / broken links check by python > scripts) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9219) Port ECJ-based linter to gradle
[ https://issues.apache.org/jira/browse/LUCENE-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-9219: -- Attachment: LUCENE-9219.patch > Port ECJ-based linter to gradle > --- > > Key: LUCENE-9219 > URL: https://issues.apache.org/jira/browse/LUCENE-9219 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Priority: Major > Attachments: LUCENE-9219.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9219) Port ECJ-based linter to gradle
[ https://issues.apache.org/jira/browse/LUCENE-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida resolved LUCENE-9219. --- Fix Version/s: master (9.0) Assignee: Tomoko Uchida Resolution: Fixed It was merged. [~dweiss] Many thanks for your kind help. > Port ECJ-based linter to gradle > --- > > Key: LUCENE-9219 > URL: https://issues.apache.org/jira/browse/LUCENE-9219 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Tomoko Uchida >Priority: Major > Fix For: master (9.0) > > Attachments: LUCENE-9219.patch, LUCENE-9219.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-9004: -- Description: "Semantic" search based on machine-learned vector "embeddings" representing terms, queries and documents is becoming a must-have feature for a modern search engine. SOLR-12890 is exploring various approaches to this, including providing vector-based scoring functions. This is a spinoff issue from that. The idea here is to explore approximate nearest-neighbor search. Researchers have found an approach based on navigating a graph that partially encodes the nearest neighbor relation at multiple scales can provide accuracy > 95% (as compared to exact nearest neighbor calculations) at a reasonable cost. This issue will explore implementing HNSW (hierarchical navigable small-world) graphs for the purpose of approximate nearest vector search (often referred to as KNN or k-nearest-neighbor search). At a high level the way this algorithm works is this. First assume you have a graph that has a partial encoding of the nearest neighbor relation, with some short and some long-distance links. If this graph is built in the right way (has the hierarchical navigable small world property), then you can efficiently traverse it to find nearest neighbors (approximately) in log N time where N is the number of nodes in the graph. I believe this idea was pioneered in [1]. The great insight in that paper is that if you use the graph search algorithm to find the K nearest neighbors of a new document while indexing, and then link those neighbors (undirectedly, ie both ways) to the new document, then the graph that emerges will have the desired properties. The implementation I propose for Lucene is as follows. We need two new data structures to encode the vectors and the graph. We can encode vectors using a light wrapper around {{BinaryDocValues}} (we also want to encode the vector dimension and have efficient conversion from bytes to floats). For the graph we can use {{SortedNumericDocValues}} where the values we encode are the docids of the related documents. Encoding the interdocument relations using docids directly will make it relatively fast to traverse the graph since we won't need to lookup through an id-field indirection. This choice limits us to building a graph-per-segment since it would be impractical to maintain a global graph for the whole index in the face of segment merges. However graph-per-segment is a very natural at search time - we can traverse each segments' graph independently and merge results as we do today for term-based search. At index time, however, merging graphs is somewhat challenging. While indexing we build a graph incrementally, performing searches to construct links among neighbors. When merging segments we must construct a new graph containing elements of all the merged segments. Ideally we would somehow preserve the work done when building the initial graphs, but at least as a start I'd propose we construct a new graph from scratch when merging. The process is going to be limited, at least initially, to graphs that can fit in RAM since we require random access to the entire graph while constructing it: In order to add links bidirectionally we must continually update existing documents. I think we want to express this API to users as a single joint {{KnnGraphField}} abstraction that joins together the vectors and the graph as a single joint field type. Mostly it just looks like a vector-valued field, but has this graph attached to it. I'll push a branch with my POC and would love to hear comments. It has many nocommits, basic design is not really set, there is no Query implementation and no integration iwth IndexSearcher, but it does work by some measure using a standalone test class. I've tested with uniform random vectors and on my laptop indexed 10K documents in around 10 seconds and searched them at 95% recall (compared with exact nearest-neighbor baseline) at around 250 QPS. I haven't made any attempt to use multithreaded search for this, but it is amenable to per-segment concurrency. [1] [https://www.semanticscholar.org/paper/Efficient-and-robust-approximate-nearest-neighbor-Malkov-Yashunin/699a2e3b653c69aff5cf7a9923793b974f8ca164] *UPDATES:* * (1/12/2020) The up-to-date branch is: [https://gitbox.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/heads/jira/lucene-9004-aknn-2] was: "Semantic" search based on machine-learned vector "embeddings" representing terms, queries and documents is becoming a must-have feature for a modern search engine. SOLR-12890 is exploring various approaches to this, including providing vector-based scoring functions. This is a spinoff issue from that. The idea here is to explore approximate nearest-neighbor search. Researchers have found an approach based on navigating a
[jira] [Updated] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-9004: -- Description: "Semantic" search based on machine-learned vector "embeddings" representing terms, queries and documents is becoming a must-have feature for a modern search engine. SOLR-12890 is exploring various approaches to this, including providing vector-based scoring functions. This is a spinoff issue from that. The idea here is to explore approximate nearest-neighbor search. Researchers have found an approach based on navigating a graph that partially encodes the nearest neighbor relation at multiple scales can provide accuracy > 95% (as compared to exact nearest neighbor calculations) at a reasonable cost. This issue will explore implementing HNSW (hierarchical navigable small-world) graphs for the purpose of approximate nearest vector search (often referred to as KNN or k-nearest-neighbor search). At a high level the way this algorithm works is this. First assume you have a graph that has a partial encoding of the nearest neighbor relation, with some short and some long-distance links. If this graph is built in the right way (has the hierarchical navigable small world property), then you can efficiently traverse it to find nearest neighbors (approximately) in log N time where N is the number of nodes in the graph. I believe this idea was pioneered in [1]. The great insight in that paper is that if you use the graph search algorithm to find the K nearest neighbors of a new document while indexing, and then link those neighbors (undirectedly, ie both ways) to the new document, then the graph that emerges will have the desired properties. The implementation I propose for Lucene is as follows. We need two new data structures to encode the vectors and the graph. We can encode vectors using a light wrapper around {{BinaryDocValues}} (we also want to encode the vector dimension and have efficient conversion from bytes to floats). For the graph we can use {{SortedNumericDocValues}} where the values we encode are the docids of the related documents. Encoding the interdocument relations using docids directly will make it relatively fast to traverse the graph since we won't need to lookup through an id-field indirection. This choice limits us to building a graph-per-segment since it would be impractical to maintain a global graph for the whole index in the face of segment merges. However graph-per-segment is a very natural at search time - we can traverse each segments' graph independently and merge results as we do today for term-based search. At index time, however, merging graphs is somewhat challenging. While indexing we build a graph incrementally, performing searches to construct links among neighbors. When merging segments we must construct a new graph containing elements of all the merged segments. Ideally we would somehow preserve the work done when building the initial graphs, but at least as a start I'd propose we construct a new graph from scratch when merging. The process is going to be limited, at least initially, to graphs that can fit in RAM since we require random access to the entire graph while constructing it: In order to add links bidirectionally we must continually update existing documents. I think we want to express this API to users as a single joint {{KnnGraphField}} abstraction that joins together the vectors and the graph as a single joint field type. Mostly it just looks like a vector-valued field, but has this graph attached to it. I'll push a branch with my POC and would love to hear comments. It has many nocommits, basic design is not really set, there is no Query implementation and no integration iwth IndexSearcher, but it does work by some measure using a standalone test class. I've tested with uniform random vectors and on my laptop indexed 10K documents in around 10 seconds and searched them at 95% recall (compared with exact nearest-neighbor baseline) at around 250 QPS. I haven't made any attempt to use multithreaded search for this, but it is amenable to per-segment concurrency. [1] [https://www.semanticscholar.org/paper/Efficient-and-robust-approximate-nearest-neighbor-Malkov-Yashunin/699a2e3b653c69aff5cf7a9923793b974f8ca164] *UPDATES:* * (1/12/2020) The up-to-date branch is: [https://github.com/apache/lucene-solr/tree/jira/lucene-9004-aknn-2] was: "Semantic" search based on machine-learned vector "embeddings" representing terms, queries and documents is becoming a must-have feature for a modern search engine. SOLR-12890 is exploring various approaches to this, including providing vector-based scoring functions. This is a spinoff issue from that. The idea here is to explore approximate nearest-neighbor search. Researchers have found an approach based on navigating a graph that partially encodes the
[jira] [Commented] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013667#comment-17013667 ] Tomoko Uchida commented on LUCENE-9004: --- [~sokolov] thanks, I myself also have tested it with a real dataset that is generated from recent snapshot files of Japanese Wikipedia. Yes it seems like "functionally correct", although we should do more formal tests for measuring Recall (effectiveness). {quote}I think it's time to post back to a branch in the Apache git repository so we can enlist contributions from the community here to help this go forward. I'll try to get that done this weekend {quote} OK, I pushed the branch to the Apache Gitbox to let others who want to involve in this issue check out it and have a try. [https://gitbox.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/heads/jira/lucene-9004-aknn-2] This also includes a patch Xin-Chun Zhang. Note: currently the new codec for the vectors and kNN graphs is placed in {{o.a.l.codecs.lucene90}}, I think we can move this to proper location when this is ready to be released. > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Attachments: hnsw_layered_graph.png > > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The implementation I propose for Lucene is as follows. We need two new data > structures to encode the vectors and the graph. We can encode vectors using a > light wrapper around {{BinaryDocValues}} (we also want to encode the vector > dimension and have efficient conversion from bytes to floats). For the graph > we can use {{SortedNumericDocValues}} where the values we encode are the > docids of the related documents. Encoding the interdocument relations using > docids directly will make it relatively fast to traverse the graph since we > won't need to lookup through an id-field indirection. This choice limits us > to building a graph-per-segment since it would be impractical to maintain a > global graph for the whole index in the face of segment merges. However > graph-per-segment is a very natural at search time - we can traverse each > segments' graph independently and merge results as we do today for term-based > search. > At index time, however, merging graphs is somewhat challenging. While > indexing we build a graph incrementally, performing searches to construct > links among neighbors. When merging segments we must construct a new graph > containing elements of all the merged segments. Ideally we would somehow > preserve the work done when building the initial graphs, but at least as a > start I'd propose we construct a new graph from scratch when merging. The > process is going to be limited, at least initially, to graphs that can fit > in RAM since we require random access to the entire graph while constructing > it: In order to add links bidirectionally we must continually update existing > documents. > I think we want to express this API to users as a single joint > {{KnnGraphField}} abstraction that joins together the vectors and the graph > as a single joint field
[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013667#comment-17013667 ] Tomoko Uchida edited comment on LUCENE-9004 at 1/12/20 7:51 AM: [~sokolov] thanks, I myself also have tested it with a real dataset that is generated from recent snapshot files of Japanese Wikipedia. Yes it seems like "functionally correct", although we should do more formal tests for measuring Recall (effectiveness). {quote}I think it's time to post back to a branch in the Apache git repository so we can enlist contributions from the community here to help this go forward. I'll try to get that done this weekend {quote} OK, I pushed the branch to the Apache Gitbox to let others who want to involve in this issue check out it and have a try. While I feel it's far from being complete :), but agree with that the code is prepared to take in contributions from the community. [https://gitbox.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/heads/jira/lucene-9004-aknn-2] This also includes a patch from Xin-Chun Zhang. Note: currently the new codec for the vectors and kNN graphs is placed in {{o.a.l.codecs.lucene90}}, I think we can move this to proper location when this is ready to be released. was (Author: tomoko uchida): [~sokolov] thanks, I myself also have tested it with a real dataset that is generated from recent snapshot files of Japanese Wikipedia. Yes it seems like "functionally correct", although we should do more formal tests for measuring Recall (effectiveness). {quote}I think it's time to post back to a branch in the Apache git repository so we can enlist contributions from the community here to help this go forward. I'll try to get that done this weekend {quote} OK, I pushed the branch to the Apache Gitbox to let others who want to involve in this issue check out it and have a try. [https://gitbox.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/heads/jira/lucene-9004-aknn-2] This also includes a patch Xin-Chun Zhang. Note: currently the new codec for the vectors and kNN graphs is placed in {{o.a.l.codecs.lucene90}}, I think we can move this to proper location when this is ready to be released. > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Attachments: hnsw_layered_graph.png > > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The implementation I propose for Lucene is as follows. We need two new data > structures to encode the vectors and the graph. We can encode vectors using a > light wrapper around {{BinaryDocValues}} (we also want to encode the vector > dimension and have efficient conversion from bytes to floats). For the graph > we can use {{SortedNumericDocValues}} where the values we encode are the > docids of the related documents. Encoding the interdocument relations using > docids directly will make it relatively fast to traverse the graph since we > won't need to lookup through an id-field indirection. This choice limits us > to building a graph-per-segment since it would be impractical
[jira] [Commented] (LUCENE-9129) Updating from 7.X to 8.X breaks
[ https://issues.apache.org/jira/browse/LUCENE-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014043#comment-17014043 ] Tomoko Uchida commented on LUCENE-9129: --- It's not a bug but an intended change related to this optimization. https://issues.apache.org/jira/browse/LUCENE-4100 The issue isn't mentioned in "API Changes" but "Optimizations" section on the Change log: [https://lucene.apache.org/core/8_0_0/changes/Changes.html|https://lucene.apache.org/core/8_4_0/changes/Changes.html]. Although it includes API changes, I think the location is appropriate for the main purpose of the issue. In short, you need to implement {{Collector#scoreMode()}} and also discard {{#needsScores()}} when upgrading to Lucene 8.0+ as the log message says. See: [https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/Collector.html] Since there were many changes/optimizations between 7.x and 8.x, it's hard to make an exhaustive list for every breaking change in the APIs (in addition, the {{Collector}} interface is marked as "Expert:", that means this is for expert users who are familiar with Lucene internals). So could you refer the Javadocs when you encounter errors relating to the library version upgrade. The Git commit log and diff command will also be help for getting more detailed information. > Updating from 7.X to 8.X breaks > --- > > Key: LUCENE-9129 > URL: https://issues.apache.org/jira/browse/LUCENE-9129 > Project: Lucene - Core > Issue Type: Bug >Reporter: xia0c >Priority: Major > > Hi, during my upgrading process from 7.X to 8.X I found another code break. > {code:java} > import org.apache.lucene.analysis.Analyzer; > import org.apache.lucene.index.LeafReaderContext; > import org.apache.lucene.index.MultiDocValues; > import org.apache.lucene.index.SortedDocValues; > import org.apache.lucene.search.IndexSearcher; > import org.apache.lucene.search.Query; > import org.apache.lucene.search.SimpleCollector; > import org.apache.lucene.search.TopDocs; > import org.apache.lucene.util.LongValues; > import org.apache.solr.handler.component.FacetComponent.FacetContext; > import org.apache.solr.search.DocSet; > import org.apache.solr.search.DocSetUtil; > import org.apache.lucene.index.IndexReader; > public class TestLucene { > > private FacetContext fcontext; > private DocSet docs; > private IndexReader reader; > > > public void demo() throws IOException { > > DocSetUtil.collectSortedDocSet(docs, reader, new > SimpleCollector() { > @Override > public boolean needsScores() { return false; } > @Override > protected void doSetNextReader(LeafReaderContext ctx) throws > IOException { > // TODO > } > @Override > public void collect(int doc) throws IOException { > // TODO Auto-generated method stub > > } > }); > > } > } > {code} > The code should pass before, but it throws an error: > {code:java} > [ERROR] /TestLucene.java:[32,82] is not abstract and > does not override abstract method scoreMode() in > org.apache.lucene.search.Collector > [ERROR] /TestLucene.java:[36,19] method does not override or implement a > method from a supertype > {code} > I try to find changes in the migration > guide(https://github.com/apache/lucene-solr/blob/branch_8x/lucene/MIGRATE.txt) > but I didn't find it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9129) Updating from 7.X to 8.X breaks
[ https://issues.apache.org/jira/browse/LUCENE-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014043#comment-17014043 ] Tomoko Uchida edited comment on LUCENE-9129 at 1/13/20 6:13 AM: It's not a bug but an intended change related to this optimization. https://issues.apache.org/jira/browse/LUCENE-4100 The issue isn't mentioned in "API Changes" but "Optimizations" section on the Change log: [https://lucene.apache.org/core/8_0_0/changes/Changes.html|https://lucene.apache.org/core/8_0_0/changes/Changes.html]. Although it includes API changes, I think the location is appropriate for the main purpose of the issue. In short, you need to implement {{Collector#scoreMode()}} and also discard {{#needsScores()}} when upgrading to Lucene 8.0+ as the log message says. See: [https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/Collector.html] Since there were many changes/optimizations between 7.x and 8.x, it's hard to make an exhaustive list for every breaking change in the APIs (in addition, the {{Collector}} interface is marked as "Expert:", that means this is for expert users who are familiar with Lucene internals). So could you refer the Javadocs when you encounter errors relating to the library version upgrade. The Git commit log and diff command will also be help for getting more detailed information. was (Author: tomoko uchida): It's not a bug but an intended change related to this optimization. https://issues.apache.org/jira/browse/LUCENE-4100 The issue isn't mentioned in "API Changes" but "Optimizations" section on the Change log: [https://lucene.apache.org/core/8_0_0/changes/Changes.html|https://lucene.apache.org/core/8_4_0/changes/Changes.html]. Although it includes API changes, I think the location is appropriate for the main purpose of the issue. In short, you need to implement {{Collector#scoreMode()}} and also discard {{#needsScores()}} when upgrading to Lucene 8.0+ as the log message says. See: [https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/Collector.html] Since there were many changes/optimizations between 7.x and 8.x, it's hard to make an exhaustive list for every breaking change in the APIs (in addition, the {{Collector}} interface is marked as "Expert:", that means this is for expert users who are familiar with Lucene internals). So could you refer the Javadocs when you encounter errors relating to the library version upgrade. The Git commit log and diff command will also be help for getting more detailed information. > Updating from 7.X to 8.X breaks > --- > > Key: LUCENE-9129 > URL: https://issues.apache.org/jira/browse/LUCENE-9129 > Project: Lucene - Core > Issue Type: Bug >Reporter: xia0c >Priority: Major > > Hi, during my upgrading process from 7.X to 8.X I found another code break. > {code:java} > import org.apache.lucene.analysis.Analyzer; > import org.apache.lucene.index.LeafReaderContext; > import org.apache.lucene.index.MultiDocValues; > import org.apache.lucene.index.SortedDocValues; > import org.apache.lucene.search.IndexSearcher; > import org.apache.lucene.search.Query; > import org.apache.lucene.search.SimpleCollector; > import org.apache.lucene.search.TopDocs; > import org.apache.lucene.util.LongValues; > import org.apache.solr.handler.component.FacetComponent.FacetContext; > import org.apache.solr.search.DocSet; > import org.apache.solr.search.DocSetUtil; > import org.apache.lucene.index.IndexReader; > public class TestLucene { > > private FacetContext fcontext; > private DocSet docs; > private IndexReader reader; > > > public void demo() throws IOException { > > DocSetUtil.collectSortedDocSet(docs, reader, new > SimpleCollector() { > @Override > public boolean needsScores() { return false; } > @Override > protected void doSetNextReader(LeafReaderContext ctx) throws > IOException { > // TODO > } > @Override > public void collect(int doc) throws IOException { > // TODO Auto-generated method stub > > } > }); > > } > } > {code} > The code should pass before, but it throws an error: > {code:java} > [ERROR] /TestLucene.java:[32,82] is not abstract and > does not override abstract method scoreMode() in > org.apache.lucene.search.Collector > [ERROR] /TestLucene.java:[36,19] method does not override or implement a > method from a supertype > {code} > I try to find changes in the migration >
[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014065#comment-17014065 ] Tomoko Uchida commented on LUCENE-9123: --- Hi [~h.kazuaki], introducing the option {{discardCompoundToken}} looks fine to me, however, I think we shouldn't change signatures of the existing constructors for backwards compatibility (they are public interface, so we have to keep them during 8.x anyways). Instead, we can add a new constructor. Opinions? > JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter > --- > > Key: LUCENE-9123 > URL: https://issues.apache.org/jira/browse/LUCENE-9123 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 8.4 >Reporter: Kazuaki Hiraga >Priority: Major > Attachments: LUCENE-9123.patch > > > JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with > both of SynonymGraphFilter and SynonymFilter when JT generates multiple > tokens as an output. If we use `mode=normal`, it should be fine. However, we > would like to use decomposed tokens that can maximize to chance to increase > recall. > Snippet of schema: > {code:xml} > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > > > synonyms="lang/synonyms_ja.txt" > tokenizerFactory="solr.JapaneseTokenizerFactory"/> > > > tags="lang/stoptags_ja.txt" /> > > > > > > minimumLength="4"/> > > > > > {code} > An synonym entry that generates error: > {noformat} > 株式会社,コーポレーション > {noformat} > The following is an output on console: > {noformat} > $ ./bin/solr create_core -c jp_test -d ../config/solrconfs > ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] > Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 > (got: 0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018278#comment-17018278 ] Tomoko Uchida commented on LUCENE-9123: --- I thought the change in the behavior has very small or no impact for users who use the Tokenizer for searching, but yes it would affect users who use it for pure tokenization purpose. While keeping backward compatibility (within the same major version) is important, not emitting compound tokens would be prefered to get along with succeeding token filters and compound tokens are not needed for most use cases. I think it'd be better that we change the behavior at some point. How about this proposal: we can create two patches, one for the master and one for 8x. On 8x branch, add the new constructor so you can use it from the next update. There is no change in the default behavior. On the master branch, switch the default behavior (users who don't like the change can still swich back by using the full constructor). > JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter > --- > > Key: LUCENE-9123 > URL: https://issues.apache.org/jira/browse/LUCENE-9123 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 8.4 >Reporter: Kazuaki Hiraga >Priority: Major > Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch > > > JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with > both of SynonymGraphFilter and SynonymFilter when JT generates multiple > tokens as an output. If we use `mode=normal`, it should be fine. However, we > would like to use decomposed tokens that can maximize to chance to increase > recall. > Snippet of schema: > {code:xml} > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > > > synonyms="lang/synonyms_ja.txt" > tokenizerFactory="solr.JapaneseTokenizerFactory"/> > > > tags="lang/stoptags_ja.txt" /> > > > > > > minimumLength="4"/> > > > > > {code} > An synonym entry that generates error: > {noformat} > 株式会社,コーポレーション > {noformat} > The following is an output on console: > {noformat} > $ ./bin/solr create_core -c jp_test -d ../config/solrconfs > ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] > Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 > (got: 0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018504#comment-17018504 ] Tomoko Uchida commented on LUCENE-9123: --- {quote}OK. I will prepare another patch for the master branch. {quote} Thanks [~h.kazuaki]! Once the work is done I can create the patch for 8x branch by applying some modifications to yours, if you feel bothered by arranging two patches. Also we need to add some tests to {{TestJapaneseTokenizer}} and {{TestJapaneseTokenizerFactory}}. And according to the custom, the final patch to the master branch should be named "LUCENE-9123.patch" so can you please overwrite the obsolete patch instead of upload new ones? {quote}Then, a person who is a maintainer of Japanese Tokenizer can choose how to merge the changes (who is responsible for Japanese Tokenizer for now?) {quote} I'm not sure if there is explicit maintainer on each Lucene module, theoretically every person who has write access to the ASF repo can commit any patches on his own responsibility. Let us wait for a few days and I will commit the patch if there are no other comments or objections. [~cm] do you have any feedback about this change? > JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter > --- > > Key: LUCENE-9123 > URL: https://issues.apache.org/jira/browse/LUCENE-9123 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 8.4 >Reporter: Kazuaki Hiraga >Priority: Major > Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch > > > JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with > both of SynonymGraphFilter and SynonymFilter when JT generates multiple > tokens as an output. If we use `mode=normal`, it should be fine. However, we > would like to use decomposed tokens that can maximize to chance to increase > recall. > Snippet of schema: > {code:xml} > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > > > synonyms="lang/synonyms_ja.txt" > tokenizerFactory="solr.JapaneseTokenizerFactory"/> > > > tags="lang/stoptags_ja.txt" /> > > > > > > minimumLength="4"/> > > > > > {code} > An synonym entry that generates error: > {noformat} > 株式会社,コーポレーション > {noformat} > The following is an output on console: > {noformat} > $ ./bin/solr create_core -c jp_test -d ../config/solrconfs > ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] > Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 > (got: 0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018518#comment-17018518 ] Tomoko Uchida commented on LUCENE-9004: --- Hi [~jtibshirani], thanks for you comments/suggestions. I will check the links you mentioned. {quote}It also suggests that graph-based kNN is an active research area and that there are likely to be improvements + new approaches that come out. {quote} Yes, there are so many proposed methods and their variants in this field. Currently I'm not fully sure that what is the most feasible approach for Lucene. Also, I just noticed an issue that proposes a product quantization based approach - roughly speaking, it may need less disk and memory space than the graph based methods like HNSW but takes more indexing and query time costs: https://issues.apache.org/jira/browse/LUCENE-9136 > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Attachments: hnsw_layered_graph.png > > Time Spent: 40m > Remaining Estimate: 0h > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The implementation I propose for Lucene is as follows. We need two new data > structures to encode the vectors and the graph. We can encode vectors using a > light wrapper around {{BinaryDocValues}} (we also want to encode the vector > dimension and have efficient conversion from bytes to floats). For the graph > we can use {{SortedNumericDocValues}} where the values we encode are the > docids of the related documents. Encoding the interdocument relations using > docids directly will make it relatively fast to traverse the graph since we > won't need to lookup through an id-field indirection. This choice limits us > to building a graph-per-segment since it would be impractical to maintain a > global graph for the whole index in the face of segment merges. However > graph-per-segment is a very natural at search time - we can traverse each > segments' graph independently and merge results as we do today for term-based > search. > At index time, however, merging graphs is somewhat challenging. While > indexing we build a graph incrementally, performing searches to construct > links among neighbors. When merging segments we must construct a new graph > containing elements of all the merged segments. Ideally we would somehow > preserve the work done when building the initial graphs, but at least as a > start I'd propose we construct a new graph from scratch when merging. The > process is going to be limited, at least initially, to graphs that can fit > in RAM since we require random access to the entire graph while constructing > it: In order to add links bidirectionally we must continually update existing > documents. > I think we want to express this API to users as a single joint > {{KnnGraphField}} abstraction that joins together the vectors and the graph > as a single joint field type. Mostly it just looks like a vector-valued > field, but has this graph attached to it. > I'll push a branch with my POC and would love to hear comments. It has many > nocommits,
[jira] [Assigned] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida reassigned LUCENE-9123: - Assignee: Tomoko Uchida > JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter > --- > > Key: LUCENE-9123 > URL: https://issues.apache.org/jira/browse/LUCENE-9123 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 8.4 >Reporter: Kazuaki Hiraga >Assignee: Tomoko Uchida >Priority: Major > Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch > > > JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with > both of SynonymGraphFilter and SynonymFilter when JT generates multiple > tokens as an output. If we use `mode=normal`, it should be fine. However, we > would like to use decomposed tokens that can maximize to chance to increase > recall. > Snippet of schema: > {code:xml} > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > > > synonyms="lang/synonyms_ja.txt" > tokenizerFactory="solr.JapaneseTokenizerFactory"/> > > > tags="lang/stoptags_ja.txt" /> > > > > > > minimumLength="4"/> > > > > > {code} > An synonym entry that generates error: > {noformat} > 株式会社,コーポレーション > {noformat} > The following is an output on console: > {noformat} > $ ./bin/solr create_core -c jp_test -d ../config/solrconfs > ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] > Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 > (got: 0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9129) Updating from 7.X to 8.X breaks
[ https://issues.apache.org/jira/browse/LUCENE-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida resolved LUCENE-9129. --- Resolution: Not A Problem > Updating from 7.X to 8.X breaks > --- > > Key: LUCENE-9129 > URL: https://issues.apache.org/jira/browse/LUCENE-9129 > Project: Lucene - Core > Issue Type: Bug >Reporter: xia0c >Priority: Major > > Hi, during my upgrading process from 7.X to 8.X I found another code break. > {code:java} > import org.apache.lucene.analysis.Analyzer; > import org.apache.lucene.index.LeafReaderContext; > import org.apache.lucene.index.MultiDocValues; > import org.apache.lucene.index.SortedDocValues; > import org.apache.lucene.search.IndexSearcher; > import org.apache.lucene.search.Query; > import org.apache.lucene.search.SimpleCollector; > import org.apache.lucene.search.TopDocs; > import org.apache.lucene.util.LongValues; > import org.apache.solr.handler.component.FacetComponent.FacetContext; > import org.apache.solr.search.DocSet; > import org.apache.solr.search.DocSetUtil; > import org.apache.lucene.index.IndexReader; > public class TestLucene { > > private FacetContext fcontext; > private DocSet docs; > private IndexReader reader; > > > public void demo() throws IOException { > > DocSetUtil.collectSortedDocSet(docs, reader, new > SimpleCollector() { > @Override > public boolean needsScores() { return false; } > @Override > protected void doSetNextReader(LeafReaderContext ctx) throws > IOException { > // TODO > } > @Override > public void collect(int doc) throws IOException { > // TODO Auto-generated method stub > > } > }); > > } > } > {code} > The code should pass before, but it throws an error: > {code:java} > [ERROR] /TestLucene.java:[32,82] is not abstract and > does not override abstract method scoreMode() in > org.apache.lucene.search.Collector > [ERROR] /TestLucene.java:[36,19] method does not override or implement a > method from a supertype > {code} > I try to find changes in the migration > guide(https://github.com/apache/lucene-solr/blob/branch_8x/lucene/MIGRATE.txt) > but I didn't find it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018187#comment-17018187 ] Tomoko Uchida commented on LUCENE-9123: --- {{quote}} However, I don't think there are many situations that we need original tokens along with decompound ones {{quote}} Personally I agree with that. Concerning full text searching, we rarely need original tokens when we use the "search mode". Why don't we set "discardCompoundToken" to true by default from here (I think this minor change in behaviour is Okay for next 8.x release)? > JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter > --- > > Key: LUCENE-9123 > URL: https://issues.apache.org/jira/browse/LUCENE-9123 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 8.4 >Reporter: Kazuaki Hiraga >Priority: Major > Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch > > > JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with > both of SynonymGraphFilter and SynonymFilter when JT generates multiple > tokens as an output. If we use `mode=normal`, it should be fine. However, we > would like to use decomposed tokens that can maximize to chance to increase > recall. > Snippet of schema: > {code:xml} > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > > > synonyms="lang/synonyms_ja.txt" > tokenizerFactory="solr.JapaneseTokenizerFactory"/> > > > tags="lang/stoptags_ja.txt" /> > > > > > > minimumLength="4"/> > > > > > {code} > An synonym entry that generates error: > {noformat} > 株式会社,コーポレーション > {noformat} > The following is an output on console: > {noformat} > $ ./bin/solr create_core -c jp_test -d ../config/solrconfs > ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] > Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 > (got: 0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018187#comment-17018187 ] Tomoko Uchida edited comment on LUCENE-9123 at 1/17/20 4:59 PM: {quote} However, I don't think there are many situations that we need original tokens along with decompound ones {quote} Personally I agree with that. Concerning full text searching, we rarely need original tokens when we use the "search mode". Why don't we set "discardCompoundToken" to true by default from here (I think this minor change in behaviour is Okay for next 8.x release)? was (Author: tomoko uchida): {{quote}} However, I don't think there are many situations that we need original tokens along with decompound ones {{quote}} Personally I agree with that. Concerning full text searching, we rarely need original tokens when we use the "search mode". Why don't we set "discardCompoundToken" to true by default from here (I think this minor change in behaviour is Okay for next 8.x release)? > JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter > --- > > Key: LUCENE-9123 > URL: https://issues.apache.org/jira/browse/LUCENE-9123 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 8.4 >Reporter: Kazuaki Hiraga >Priority: Major > Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch > > > JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with > both of SynonymGraphFilter and SynonymFilter when JT generates multiple > tokens as an output. If we use `mode=normal`, it should be fine. However, we > would like to use decomposed tokens that can maximize to chance to increase > recall. > Snippet of schema: > {code:xml} > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > > > synonyms="lang/synonyms_ja.txt" > tokenizerFactory="solr.JapaneseTokenizerFactory"/> > > > tags="lang/stoptags_ja.txt" /> > > > > > > minimumLength="4"/> > > > > > {code} > An synonym entry that generates error: > {noformat} > 株式会社,コーポレーション > {noformat} > The following is an output on console: > {noformat} > $ ./bin/solr create_core -c jp_test -d ../config/solrconfs > ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] > Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 > (got: 0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012942#comment-17012942 ] Tomoko Uchida commented on LUCENE-9004: --- Hi, it seems that some devs are strongly interested in this issue and I privately have received feedback (and expectations). So I just wanted to share my latest WIP branch. [https://github.com/mocobeta/lucene-solr-mirror/commits/jira/LUCENE-9004-aknn-2] And an usage code snippet for that is: [https://gist.github.com/mocobeta/a5b18506ebc933c0afa7ab61d1dd2295] I introduced a brand new codecs and indexer for vector search so this no longer depends on DocValues, though it's still on pretty early stage (especially, segment merging is not yet implemented). I intend to continue to work and I'll do my best, but to be honest I am not sure if my approach is the best - or I can create a great patch that can be merged to Lucene core... I welcome that someone takes over it in some different, more sophisticated/efficient ways. My current attempt might be useful as a reference or the starting point. > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Attachments: hnsw_layered_graph.png > > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The implementation I propose for Lucene is as follows. We need two new data > structures to encode the vectors and the graph. We can encode vectors using a > light wrapper around {{BinaryDocValues}} (we also want to encode the vector > dimension and have efficient conversion from bytes to floats). For the graph > we can use {{SortedNumericDocValues}} where the values we encode are the > docids of the related documents. Encoding the interdocument relations using > docids directly will make it relatively fast to traverse the graph since we > won't need to lookup through an id-field indirection. This choice limits us > to building a graph-per-segment since it would be impractical to maintain a > global graph for the whole index in the face of segment merges. However > graph-per-segment is a very natural at search time - we can traverse each > segments' graph independently and merge results as we do today for term-based > search. > At index time, however, merging graphs is somewhat challenging. While > indexing we build a graph incrementally, performing searches to construct > links among neighbors. When merging segments we must construct a new graph > containing elements of all the merged segments. Ideally we would somehow > preserve the work done when building the initial graphs, but at least as a > start I'd propose we construct a new graph from scratch when merging. The > process is going to be limited, at least initially, to graphs that can fit > in RAM since we require random access to the entire graph while constructing > it: In order to add links bidirectionally we must continually update existing > documents. > I think we want to express this API to users as a single joint > {{KnnGraphField}} abstraction that joins together the vectors and the graph > as a single joint field type.
[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012942#comment-17012942 ] Tomoko Uchida edited comment on LUCENE-9004 at 1/10/20 3:02 PM: Hi, it seems that some devs are strongly interested in this issue and I privately have received feedback (and expectations). So I just wanted to share my latest WIP branch. [https://github.com/mocobeta/lucene-solr-mirror/commits/jira/LUCENE-9004-aknn-2] And here is an usage code snippet for that: [https://gist.github.com/mocobeta/a5b18506ebc933c0afa7ab61d1dd2295] I introduced a brand new codec and indexer for vector search so this no longer depends on DocValues, though it's still on pretty early stage (especially, segment merging is not yet implemented). I intend to continue to work and I'll do my best, but to be honest I am not sure if my approach is the best - or I can create a great patch that can be merged to Lucene core... I welcome that someone takes over it in some different, more sophisticated/efficient ways. My current attempt might be useful as a reference or the starting point. was (Author: tomoko uchida): Hi, it seems that some devs are strongly interested in this issue and I privately have received feedback (and expectations). So I just wanted to share my latest WIP branch. [https://github.com/mocobeta/lucene-solr-mirror/commits/jira/LUCENE-9004-aknn-2] And an usage code snippet for that is: [https://gist.github.com/mocobeta/a5b18506ebc933c0afa7ab61d1dd2295] I introduced a brand new codecs and indexer for vector search so this no longer depends on DocValues, though it's still on pretty early stage (especially, segment merging is not yet implemented). I intend to continue to work and I'll do my best, but to be honest I am not sure if my approach is the best - or I can create a great patch that can be merged to Lucene core... I welcome that someone takes over it in some different, more sophisticated/efficient ways. My current attempt might be useful as a reference or the starting point. > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Attachments: hnsw_layered_graph.png > > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The implementation I propose for Lucene is as follows. We need two new data > structures to encode the vectors and the graph. We can encode vectors using a > light wrapper around {{BinaryDocValues}} (we also want to encode the vector > dimension and have efficient conversion from bytes to floats). For the graph > we can use {{SortedNumericDocValues}} where the values we encode are the > docids of the related documents. Encoding the interdocument relations using > docids directly will make it relatively fast to traverse the graph since we > won't need to lookup through an id-field indirection. This choice limits us > to building a graph-per-segment since it would be impractical to maintain a > global graph for the whole index in the face of segment merges. However > graph-per-segment is a very natural at search time -
[jira] [Commented] (LUCENE-9119) Support (re-implement) "reconstruct" feature
[ https://issues.apache.org/jira/browse/LUCENE-9119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009764#comment-17009764 ] Tomoko Uchida commented on LUCENE-9119: --- Main reason why they need "re-construct" is: > Seeing the terms together in the order they went in, with position gaps helps > us understand why (say) a phrase query which spans multiple adjacent terms > didn’t match. > Support (re-implement) "reconstruct" feature > > > Key: LUCENE-9119 > URL: https://issues.apache.org/jira/browse/LUCENE-9119 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/luke >Reporter: Tomoko Uchida >Priority: Major > > I have dropped the "reconstruct" feature from Documents tab when porting > Thinlet to Swing. However, there is a strong request for supporting this > feature. > [https://github.com/DmitryKey/luke/pull/177] > I think it's easlily possible and not hurmful to just restore terms (with > their positions) and show them in a popup for a field. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org