[jira] [Commented] (LUCENE-8945) Allow to change the output file delimiter on Luke "export terms" feature

2019-09-18 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932147#comment-16932147
 ] 

Tomoko Uchida commented on LUCENE-8945:
---

+1

I will commit it to the ASF repo in shortly.

> Allow to change the output file delimiter on Luke "export terms" feature
> 
>
> Key: LUCENE-8945
> URL: https://issues.apache.org/jira/browse/LUCENE-8945
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/luke
>Reporter: Tomoko Uchida
>Priority: Minor
> Attachments: LUCENE-8945.patch, LUCENE-8945.patch, 
> delimiter_comma_exported_file.PNG, delimiter_space_exported_file.PNG, 
> delimiter_tab_exported_file.PNG, luke_export_delimiter.png
>
>
> This is a follow-up issue for LUCENE-8764.
> Current delimiter is fixed to "," (comma), but terms also can include comma 
> and they are not escaped. It would be better if the delimiter can be 
> changed/selected to a tab or whitespace when exporting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8945) Allow to change the output file delimiter on Luke "export terms" feature

2019-09-18 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-8945:
--
Attachment: LUCENE-8945-final.patch

> Allow to change the output file delimiter on Luke "export terms" feature
> 
>
> Key: LUCENE-8945
> URL: https://issues.apache.org/jira/browse/LUCENE-8945
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/luke
>Reporter: Tomoko Uchida
>Priority: Minor
> Attachments: LUCENE-8945-final.patch, LUCENE-8945.patch, 
> LUCENE-8945.patch, delimiter_comma_exported_file.PNG, 
> delimiter_space_exported_file.PNG, delimiter_tab_exported_file.PNG, 
> luke_export_delimiter.png
>
>
> This is a follow-up issue for LUCENE-8764.
> Current delimiter is fixed to "," (comma), but terms also can include comma 
> and they are not escaped. It would be better if the delimiter can be 
> changed/selected to a tab or whitespace when exporting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8945) Allow to change the output file delimiter on Luke "export terms" feature

2019-09-18 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932320#comment-16932320
 ] 

Tomoko Uchida commented on LUCENE-8945:
---

Seems that ASF bot is not working...

Committed to the master and 8x, with slight modification (moved Delimiter to 
private enum, it's used only in the factory anyway). Here is the final patch 
[^LUCENE-8945-final.patch]

[https://github.com/apache/lucene-solr/commit/369df12c2cc54e929bd25dd77424242ddd0fb047]

Thanks [~shahamish150294]!

> Allow to change the output file delimiter on Luke "export terms" feature
> 
>
> Key: LUCENE-8945
> URL: https://issues.apache.org/jira/browse/LUCENE-8945
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/luke
>Reporter: Tomoko Uchida
>Priority: Minor
> Attachments: LUCENE-8945-final.patch, LUCENE-8945.patch, 
> LUCENE-8945.patch, delimiter_comma_exported_file.PNG, 
> delimiter_space_exported_file.PNG, delimiter_tab_exported_file.PNG, 
> luke_export_delimiter.png
>
>
> This is a follow-up issue for LUCENE-8764.
> Current delimiter is fixed to "," (comma), but terms also can include comma 
> and they are not escaped. It would be better if the delimiter can be 
> changed/selected to a tab or whitespace when exporting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8945) Allow to change the output file delimiter on Luke "export terms" feature

2019-09-18 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved LUCENE-8945.
---
Fix Version/s: 8.3
   master (9.0)
 Assignee: Tomoko Uchida
   Resolution: Fixed

> Allow to change the output file delimiter on Luke "export terms" feature
> 
>
> Key: LUCENE-8945
> URL: https://issues.apache.org/jira/browse/LUCENE-8945
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/luke
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Minor
> Fix For: master (9.0), 8.3
>
> Attachments: LUCENE-8945-final.patch, LUCENE-8945.patch, 
> LUCENE-8945.patch, delimiter_comma_exported_file.PNG, 
> delimiter_space_exported_file.PNG, delimiter_tab_exported_file.PNG, 
> luke_export_delimiter.png
>
>
> This is a follow-up issue for LUCENE-8764.
> Current delimiter is fixed to "," (comma), but terms also can include comma 
> and they are not escaped. It would be better if the delimiter can be 
> changed/selected to a tab or whitespace when exporting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2019-11-18 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976542#comment-16976542
 ] 

Tomoko Uchida commented on LUCENE-9004:
---

Thanks for mentioning, I was working on this issue for couple of weeks and here 
is my WIP/PoC branch (actually it's not a PR, because "Query" part is still 
missing).
 [https://github.com/mocobeta/lucene-solr-mirror/commits/jira/LUCENE-9004-aknn]

I borrowed [~sokolov]'s idea but took different implementation approach:
 - Introduce new codec (Format, Writer, and Reader) for the graph part. The new 
{{GraphFormat}} can express multi level (document) graph.
 - Introduce new doc values field type for the vector part. The new 
{{VectorDocValues}} shares the same codec to BinaryDocValues but provides 
special functionalities for dense vector handling: encode/decode float array 
to/from binary value, keep num of dimensions and distance function, and allow 
random access to underlying binary doc values. (For now I just reset 
IndexedDISI when seeking backwards.)

It works but there are indexing performance concerns (due to costly graph 
construction). Anyway I hope I can create a PR with working examples before 
long...

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> 

[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2019-11-23 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980772#comment-16980772
 ] 

Tomoko Uchida commented on LUCENE-9004:
---

Just for status update:
 [my PoC 
branch|https://github.com/mocobeta/lucene-solr-mirror/tree/jira/LUCENE-9004-aknn]
 is still on pretty early stage and works only on one segment, but now it can 
index and query arbitrary vectors by [this example 
code|https://gist.github.com/mocobeta/5c174ee9fc6408470057a9e7d2020c45]. The 
newly added KnnGraphQuery is an extension of Query class so it should be able 
to be combined with other queries with some limitations, because the knn query 
cannot score entire dataset in nature. Indexing performance is terrible for now 
(it takes a few minutes for a hundred of thousands vectors w/ 100 dims on 
commodity PCs), but searching doesn't look too bad (it takes ~30 msec for the 
same dataset) thanks to the skip list-like graph structure.

On my current branch I wrapped {{BinaryDocValues}} to store vector values. 
However, exposing random access capability for doc values (or its extensions) 
can be controversial, so I'd like to propose a new codec which combines 1. the 
HNSW graph and 2. the vectors (float arrays).

The new format for each vector field would have three parts (in other words, 
three files in a segment). They would look like:
{code:java}
 
 Meta data and index part:
 +--+
 | meta data|
 ++-+
 | doc id | offset to first friend list for the doc |
 ++-+
 | doc id | offset to first friend list for the doc |
 ++-+
 |  ..  |
 ++-+

 Graph data part:
 
+-+---+-+-+
 | friends list at layer N | friends list at layer N-1 |  .. | friends list 
at level 0 |
 
+-+---+-+-+
 | friends list at layer N | friends list at layer N-1 |  .. | friends list 
at level 0 |
 
+-+---+-+-+
 |..
   |
 
+-+

 Vector data part:
 +--+
 | encoded vector value |
 +--+
 | encoded vector value |
 +--+
 |   .. |
 +--+

{code}
 - "meta data" includes: number of dimensions, distance function for similarity 
calculation, and other field level meta data
 - "doc id" is: doc ids having a vector value on this field
 - "friends list at layer N" is: a delta encoded target doc id list where each 
target doc is connected to the doc at Nth layer
 - "encoded vector value" is: a fixed length byte array. the offset of the 
value can be calculated on the fly. (limitations: each document can have only 
one vector value for each vector field)

The graph data (friends lists) is relatively small so we could keep all of them 
on the Java heap for fast retrieval (though some off-heap strategy might be 
required for very large graphs).
 The vector data (vector values) is large and only the small fraction of it is 
needed when searching, so they should be accessed by on-demand style via the 
index.

Feedback is welcomed.

And I have a question about introducing new formats - is there a way to inject 
XXXFormat to the indexing chain so that we can add in this feature without any 
change on the {{lucene-core}}?

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable 

[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

2019-11-23 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980772#comment-16980772
 ] 

Tomoko Uchida edited comment on LUCENE-9004 at 11/23/19 4:23 PM:
-

Just for status update:
 [my PoC 
branch|https://github.com/mocobeta/lucene-solr-mirror/tree/jira/LUCENE-9004-aknn]
 is still on pretty early stage and works only on one segment, but now it can 
index and query arbitrary vectors by [this example 
code|https://gist.github.com/mocobeta/5c174ee9fc6408470057a9e7d2020c45]. The 
newly added KnnGraphQuery is an extension of Query class so it should be able 
to be combined with other queries with some limitations, because the knn query 
cannot score entire dataset in nature. Indexing performance is terrible for now 
(it takes a few minutes for a hundred of thousands vectors w/ 100 dims on 
commodity PCs), but searching doesn't look too bad (it takes ~30 msec for the 
same dataset) thanks to the skip list-like graph structure.

On my current branch I wrapped {{BinaryDocValues}} to store vector values. 
However, exposing random access capability for doc values (or its extensions) 
can be controversial, so I'd like to propose a new codec which combines 1. the 
HNSW graph and 2. the vectors (float arrays).

The new format for each vector field would have three parts (in other words, 
three files in a segment). They would look like:
{code:java}
 
 Meta data and index part:
 +--+
 | meta data|
 ++-+
 | doc id | offset to first friend list for the doc |
 ++-+
 | doc id | offset to first friend list for the doc |
 ++-+
 |  ..  |
 ++-+

 Graph data part:
 
+-+---+-+-+
 | friends list at layer N | friends list at layer N-1 |  .. | friends list 
at level 0 | <- friends lists for doc 0
 
+-+---+-+-+
 | friends list at layer N | friends list at layer N-1 |  .. | friends list 
at level 0 | <- friends lists for doc 1
 
+-+---+-+-+
 |..
   | <- and so on
 
+-+

 Vector data part:
 +--+
 | encoded vector value | <- vector value for doc 0
 +--+
 | encoded vector value | <- vector value for doc 1
 +--+
 |   .. | <- and so on
 +--+

{code}
 - "meta data" includes: number of dimensions, distance function for similarity 
calculation, and other field level meta data
 - "doc id" is: doc ids having a vector value on this field
 - "friends list at layer N" is: a delta encoded target doc id list where each 
target doc is connected to the doc at Nth layer
 - "encoded vector value" is: a fixed length byte array. the offset of the 
value can be calculated on the fly. (limitations: each document can have only 
one vector value for each vector field)

The graph data (friends lists) is relatively small so we could keep all of them 
on the Java heap for fast retrieval (though some off-heap strategy might be 
required for very large graphs).
 The vector data (vector values) is large and only the small fraction of it is 
needed when searching, so they should be kept on disk and accessed by some 
on-demand style.

Feedback is welcomed.

And I have a question about introducing new formats - is there a way to inject 
XXXFormat to the indexing chain so that we can add in this feature without any 
change on the {{lucene-core}}?


was (Author: tomoko uchida):
Just for status update:
 [my PoC 
branch|https://github.com/mocobeta/lucene-solr-mirror/tree/jira/LUCENE-9004-aknn]
 is still on pretty early stage and works only on one segment, but now it can 
index and query arbitrary vectors by [this example 
code|https://gist.github.com/mocobeta/5c174ee9fc6408470057a9e7d2020c45]. The 
newly added KnnGraphQuery is an extension of Query class so it should be able 
to be combined with other queries with some limitations, because the knn query 
cannot score entire dataset in nature. Indexing performance is terrible for now 
(it takes a few minutes for a hundred of thousands vectors w/ 100 dims on 
commodity PCs), but searching doesn't look too bad (it takes ~30 msec for the 
same dataset) thanks to the skip list-like graph structure.

On my current branch I wrapped {{BinaryDocValues}} to store vector 

[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

2019-11-23 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980772#comment-16980772
 ] 

Tomoko Uchida edited comment on LUCENE-9004 at 11/23/19 3:52 PM:
-

Just for status update:
 [my PoC 
branch|https://github.com/mocobeta/lucene-solr-mirror/tree/jira/LUCENE-9004-aknn]
 is still on pretty early stage and works only on one segment, but now it can 
index and query arbitrary vectors by [this example 
code|https://gist.github.com/mocobeta/5c174ee9fc6408470057a9e7d2020c45]. The 
newly added KnnGraphQuery is an extension of Query class so it should be able 
to be combined with other queries with some limitations, because the knn query 
cannot score entire dataset in nature. Indexing performance is terrible for now 
(it takes a few minutes for a hundred of thousands vectors w/ 100 dims on 
commodity PCs), but searching doesn't look too bad (it takes ~30 msec for the 
same dataset) thanks to the skip list-like graph structure.

On my current branch I wrapped {{BinaryDocValues}} to store vector values. 
However, exposing random access capability for doc values (or its extensions) 
can be controversial, so I'd like to propose a new codec which combines 1. the 
HNSW graph and 2. the vectors (float arrays).

The new format for each vector field would have three parts (in other words, 
three files in a segment). They would look like:
{code:java}
 
 Meta data and index part:
 +--+
 | meta data|
 ++-+
 | doc id | offset to first friend list for the doc |
 ++-+
 | doc id | offset to first friend list for the doc |
 ++-+
 |  ..  |
 ++-+

 Graph data part:
 
+-+---+-+-+
 | friends list at layer N | friends list at layer N-1 |  .. | friends list 
at level 0 | <- friends lists for doc 0
 
+-+---+-+-+
 | friends list at layer N | friends list at layer N-1 |  .. | friends list 
at level 0 | <- friends lists for doc 1
 
+-+---+-+-+
 |..
   | <- and so on
 
+-+

 Vector data part:
 +--+
 | encoded vector value | <- vector value for doc 0
 +--+
 | encoded vector value | <- vector value for doc 1
 +--+
 |   .. | <- and so on
 +--+

{code}
 - "meta data" includes: number of dimensions, distance function for similarity 
calculation, and other field level meta data
 - "doc id" is: doc ids having a vector value on this field
 - "friends list at layer N" is: a delta encoded target doc id list where each 
target doc is connected to the doc at Nth layer
 - "encoded vector value" is: a fixed length byte array. the offset of the 
value can be calculated on the fly. (limitations: each document can have only 
one vector value for each vector field)

The graph data (friends lists) is relatively small so we could keep all of them 
on the Java heap for fast retrieval (though some off-heap strategy might be 
required for very large graphs).
 The vector data (vector values) is large and only the small fraction of it is 
needed when searching, so they should be accessed by on-demand style via the 
index.

Feedback is welcomed.

And I have a question about introducing new formats - is there a way to inject 
XXXFormat to the indexing chain so that we can add in this feature without any 
change on the {{lucene-core}}?


was (Author: tomoko uchida):
Just for status update:
 [my PoC 
branch|https://github.com/mocobeta/lucene-solr-mirror/tree/jira/LUCENE-9004-aknn]
 is still on pretty early stage and works only on one segment, but now it can 
index and query arbitrary vectors by [this example 
code|https://gist.github.com/mocobeta/5c174ee9fc6408470057a9e7d2020c45]. The 
newly added KnnGraphQuery is an extension of Query class so it should be able 
to be combined with other queries with some limitations, because the knn query 
cannot score entire dataset in nature. Indexing performance is terrible for now 
(it takes a few minutes for a hundred of thousands vectors w/ 100 dims on 
commodity PCs), but searching doesn't look too bad (it takes ~30 msec for the 
same dataset) thanks to the skip list-like graph structure.

On my current branch I wrapped {{BinaryDocValues}} to store vector values. 

[jira] [Updated] (LUCENE-9004) Approximate nearest vector search

2019-10-19 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-9004:
--
Attachment: hnsw_layered_graph.png
Status: Open  (was: Open)

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. Mostly it just looks like a vector-valued 
> field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many 
> nocommits, basic design is not really set, there is no Query implementation 
> and no integration iwth IndexSearcher, but it does work by some measure using 
> a standalone test class. I've tested with uniform random vectors and on my 
> laptop indexed 10K documents in around 10 seconds and searched them at 95% 
> recall (compared with exact nearest-neighbor baseline) at around 250 QPS. I 
> haven't made any attempt to use multithreaded search for this, but it is 
> amenable to per-segment concurrency.
> [1] 
> https://www.semanticscholar.org/paper/Efficient-and-robust-approximate-nearest-neighbor-Malkov-Yashunin/699a2e3b653c69aff5cf7a9923793b974f8ca164



--
This message was sent by Atlassian Jira

[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

2019-10-19 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955354#comment-16955354
 ] 

Tomoko Uchida edited comment on LUCENE-9004 at 10/20/19 2:13 AM:
-

Hi,
 I've been trying to understand the Hierachical NSW paper, its previous work 
NSW model 
([https://publications.hse.ru/mirror/pubs/share/folder/x5p6h7thif/direct/128296059]),
 and the PoC implementation.

Just to clarify the discussion (and my understanding) here, I'd like to leave 
my comments about data structures/encodings.
 - We need two data structures as Michael Sokolov described, 1) the vector 
(value tied up to a vertex) and 2) layered undirectional graph (I think the 
figure below helps our understanding).
 - 1) the vector (float array) can simply be represented by {{BinaryDocValues}} 
as current PoC. For more efficient data access for large indexes, it may be 
required to introduce new Formats.
 - Indeed we need random access - it's not related to 1) the vector (the value) 
representation itself but required for 2), to traverse/search the 
"undirectional" graph.
 - Additionally, we need a skip-list like structure to encode the hierarchical 
graph (still not implemented).

!hnsw_layered_graph.png!

My feeling is that at least we need a new Format to represent 2), layered 
undirectional graph. Current PoC branch encodes the Layer 0 by 
{{SortedNumericDocValues}}, however, we will eventually intend to introduce 
multiple layers, it wouldn't be possible to represent those by existing doc 
values. (Please correct me if this is not true :) )

Or would it be possible that we introduce a new, dedicated auxiliary data 
structure / algorithm for HNSW apart from postings lists, like FST? I mean, for 
layered undirectional graph construction/traversal, we could have 
o.a.l.util.hnsw package. It's just an idea and I'm now attempting to delve into 
that... [~sokolov] have you considered it so far?


was (Author: tomoko uchida):
Hi,
I've been trying to understand the Hierachical NSW paper, its previous work NSW 
model  
(https://publications.hse.ru/mirror/pubs/share/folder/x5p6h7thif/direct/128296059),
 and the PoC implementation.

Just to clarify the discussion (and my understanding) here, I'd like to leave 
my comments about data structures/encodings.

- We need two data structures as Michael Sokolov described, 1) the vector 
(value tied up to a vertex) and 2) layered undirectional graph (I think the 
figure below helps our understanding).
- 1) the vector (float array) can simply be represented by {{BinaryDocValues}} 
as current PoC. For more efficient data access for large indexes, it may be 
required to introduce new Formats. 
- Indeed we need random access - it's not related to 1) the vector (the value) 
representation itself but required for 2), to traverse/search the 
"undirectional" graph. 
- Additionally, we need a skip-list like structure to encode the graph 
hierarchy (still not implemented).

 !hnsw_layered_graph.png! 

My feeling is that at least we need a new Format to represent 2), layered 
undirectional graph. Current PoC branch encodes the Layer 0 by 
{{SortedNumericDocValues}}, however, we will eventually intend to introduce 
multiple layers, it wouldn't be possible to represent those by existing doc 
values. (Please correct me if this is not true :) )

Or would it be possible that we introduce a new, dedicated auxiliary data 
structure / algorithm for HNSW apart from postings lists, like FST? I mean, for 
layered undirectional graph construction/traversal, we could have 
o.a.l.util.hnsw package. It's just an idea and I'm now attempting to delve into 
that...  [~sokolov] have you considered it so far?

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level 

[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2019-10-19 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955396#comment-16955396
 ] 

Tomoko Uchida commented on LUCENE-9004:
---

{quote}Indeed we need random access - it's not related to 1) the vector (the 
value) representation itself but required for 2), to traverse/search the 
"undirectional" graph.
{quote}
ah sorry it would not be correct, we would need random access for both of the 
vector and the graph... Please ignore this line.

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. Mostly it just looks like a vector-valued 
> field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many 
> nocommits, basic design is not really set, there is no Query implementation 
> and no integration iwth IndexSearcher, but it does work by some measure using 
> a standalone test class. I've tested with uniform random vectors and on my 
> laptop indexed 10K documents in around 10 seconds and searched them at 95% 
> recall (compared with exact nearest-neighbor baseline) at around 250 QPS. I 
> haven't made any attempt to use multithreaded 

[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

2019-10-19 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955354#comment-16955354
 ] 

Tomoko Uchida edited comment on LUCENE-9004 at 10/20/19 5:29 AM:
-

Hi,
 I've been trying to understand the Hierachical NSW paper, its previous work 
NSW model 
([https://publications.hse.ru/mirror/pubs/share/folder/x5p6h7thif/direct/128296059]),
 and the PoC implementation.

Just to clarify the discussion (and my understanding) here, I'd like to leave 
my comments about data structures/encodings.
 - We need two data structures as Michael Sokolov described, 1) the vector 
(value tied up to a vertex) and 2) layered undirectional graph (I think the 
figure below helps our understanding).
 - 1) the vector (float array) can simply be represented by {{BinaryDocValues}} 
as current PoC. For more efficient data access for large indexes, it may be 
required to introduce new Formats.
 - Indeed we need random access - i-t's not related to 1) the vector (the 
value) representation itself but required for 2), to traverse/search the 
"undirectional" graph.-
 - Additionally, we need a skip-list like structure to encode the hierarchical 
graph (still not implemented).

!hnsw_layered_graph.png!

My feeling is that at least we need a new Format to represent 2), layered 
undirectional graph. Current PoC branch encodes the Layer 0 by 
{{SortedNumericDocValues}}, however, we will eventually intend to introduce 
multiple layers, it wouldn't be possible to represent those by existing doc 
values. (Please correct me if this is not true :) )

Or would it be possible that we introduce a new, dedicated auxiliary data 
structure / algorithm for HNSW apart from postings lists, like FST? I mean, for 
layered undirectional graph construction/traversal, we could have 
o.a.l.util.hnsw package. It's just an idea and I'm now attempting to delve into 
that... [~sokolov] have you considered it so far?


was (Author: tomoko uchida):
Hi,
 I've been trying to understand the Hierachical NSW paper, its previous work 
NSW model 
([https://publications.hse.ru/mirror/pubs/share/folder/x5p6h7thif/direct/128296059]),
 and the PoC implementation.

Just to clarify the discussion (and my understanding) here, I'd like to leave 
my comments about data structures/encodings.
 - We need two data structures as Michael Sokolov described, 1) the vector 
(value tied up to a vertex) and 2) layered undirectional graph (I think the 
figure below helps our understanding).
 - 1) the vector (float array) can simply be represented by {{BinaryDocValues}} 
as current PoC. For more efficient data access for large indexes, it may be 
required to introduce new Formats.
 - Indeed we need random access - it's not related to 1) the vector (the value) 
representation itself but required for 2), to traverse/search the 
"undirectional" graph.
 - Additionally, we need a skip-list like structure to encode the hierarchical 
graph (still not implemented).

!hnsw_layered_graph.png!

My feeling is that at least we need a new Format to represent 2), layered 
undirectional graph. Current PoC branch encodes the Layer 0 by 
{{SortedNumericDocValues}}, however, we will eventually intend to introduce 
multiple layers, it wouldn't be possible to represent those by existing doc 
values. (Please correct me if this is not true :) )

Or would it be possible that we introduce a new, dedicated auxiliary data 
structure / algorithm for HNSW apart from postings lists, like FST? I mean, for 
layered undirectional graph construction/traversal, we could have 
o.a.l.util.hnsw package. It's just an idea and I'm now attempting to delve into 
that... [~sokolov] have you considered it so far?

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high 

[jira] [Resolved] (LUCENE-8998) OverviewImplTest.testIsOptimized reproducible failure

2019-10-05 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved LUCENE-8998.
---
Fix Version/s: 8.3
   master (9.0)
   Resolution: Fixed

Thank you [~hossman] for reporting.

> OverviewImplTest.testIsOptimized reproducible failure
> -
>
> Key: LUCENE-8998
> URL: https://issues.apache.org/jira/browse/LUCENE-8998
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: luke
>Reporter: Chris M. Hostetter
>Assignee: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0), 8.3
>
> Attachments: LUCENE-8998.patch
>
>
> The following seed reproduces reliably for me on master...
> (NOTE: the {{ERROR StatusLogger}} messages include the one about the 
> AccessControlException occur even with other seeds when the test passes)
> {noformat}
> [mkdir] Created dir: /home/hossman/lucene/alt_dev/lucene/build/luke/test
> [junit4:pickseed] Seed property 'tests.seed' already defined: 9123DD19C50D658
> [mkdir] Created dir: 
> /home/hossman/lucene/alt_dev/lucene/build/luke/test/temp
>[junit4]  says cześć! Master seed: 9123DD19C50D658
>[junit4] Executing 1 suite with 1 JVM.
>[junit4] 
>[junit4] Started J0 PID(8576@localhost).
>[junit4] Suite: org.apache.lucene.luke.models.overview.OverviewImplTest
>[junit4]   2> ERROR StatusLogger No Log4j 2 configuration file found. 
> Using default configuration (logging only errors to the console), or user 
> programmatically provided configurations. Set system property 'log4j2.debug' 
> to show Log4j 2 internal initialization logging. See 
> https://logging.apache.org/log4j/2.x/manual/configuration.html for 
> instructions on how to configure Log4j 2
>[junit4]   2> ERROR StatusLogger Could not reconfigure JMX
>[junit4]   2>  java.security.AccessControlException: access denied 
> ("javax.management.MBeanServerPermission" "createMBeanServer")
>[junit4]   2>  at 
> java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
>[junit4]   2>  at 
> java.base/java.security.AccessController.checkPermission(AccessController.java:897)
>[junit4]   2>  at 
> java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:322)
>[junit4]   2>  at 
> java.management/java.lang.management.ManagementFactory.getPlatformMBeanServer(ManagementFactory.java:479)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.jmx.Server.reregisterMBeansAfterReconfigure(Server.java:140)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.LoggerContext.setConfiguration(LoggerContext.java:559)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:620)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:637)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:231)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:153)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:45)
>[junit4]   2>  at 
> org.apache.logging.log4j.LogManager.getContext(LogManager.java:194)
>[junit4]   2>  at 
> org.apache.logging.log4j.LogManager.getLogger(LogManager.java:581)
>[junit4]   2>  at 
> org.apache.lucene.luke.util.LoggerFactory.getLogger(LoggerFactory.java:70)
>[junit4]   2>  at 
> org.apache.lucene.luke.models.util.IndexUtils.(IndexUtils.java:62)
>[junit4]   2>  at 
> org.apache.lucene.luke.models.LukeModel.(LukeModel.java:60)
>[junit4]   2>  at 
> org.apache.lucene.luke.models.overview.OverviewImpl.(OverviewImpl.java:50)
>[junit4]   2>  at 
> org.apache.lucene.luke.models.overview.OverviewImplTest.testIsOptimized(OverviewImplTest.java:77)
>[junit4]   2>  at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>[junit4]   2>  at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>[junit4]   2>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>[junit4]   2>  at 
> java.base/java.lang.reflect.Method.invoke(Method.java:566)
>[junit4]   2>  at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
>[junit4]   2>  at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
>[junit4]   2>  at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
>[junit4]   2>  at 
> 

[jira] [Commented] (LUCENE-8998) OverviewImplTest.testIsOptimized reproducible failure

2019-10-05 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16945009#comment-16945009
 ] 

Tomoko Uchida commented on LUCENE-8998:
---

I'm not sure this old feature has a meaning with current Lucene's segments 
management strategy, but I attached a patch that uses NoMergePolicy to prevent 
segment merges when testing {{isOptimized()}} method.

> OverviewImplTest.testIsOptimized reproducible failure
> -
>
> Key: LUCENE-8998
> URL: https://issues.apache.org/jira/browse/LUCENE-8998
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: luke
>Reporter: Chris M. Hostetter
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-8998.patch
>
>
> The following seed reproduces reliably for me on master...
> (NOTE: the {{ERROR StatusLogger}} messages include the one about the 
> AccessControlException occur even with other seeds when the test passes)
> {noformat}
> [mkdir] Created dir: /home/hossman/lucene/alt_dev/lucene/build/luke/test
> [junit4:pickseed] Seed property 'tests.seed' already defined: 9123DD19C50D658
> [mkdir] Created dir: 
> /home/hossman/lucene/alt_dev/lucene/build/luke/test/temp
>[junit4]  says cześć! Master seed: 9123DD19C50D658
>[junit4] Executing 1 suite with 1 JVM.
>[junit4] 
>[junit4] Started J0 PID(8576@localhost).
>[junit4] Suite: org.apache.lucene.luke.models.overview.OverviewImplTest
>[junit4]   2> ERROR StatusLogger No Log4j 2 configuration file found. 
> Using default configuration (logging only errors to the console), or user 
> programmatically provided configurations. Set system property 'log4j2.debug' 
> to show Log4j 2 internal initialization logging. See 
> https://logging.apache.org/log4j/2.x/manual/configuration.html for 
> instructions on how to configure Log4j 2
>[junit4]   2> ERROR StatusLogger Could not reconfigure JMX
>[junit4]   2>  java.security.AccessControlException: access denied 
> ("javax.management.MBeanServerPermission" "createMBeanServer")
>[junit4]   2>  at 
> java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
>[junit4]   2>  at 
> java.base/java.security.AccessController.checkPermission(AccessController.java:897)
>[junit4]   2>  at 
> java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:322)
>[junit4]   2>  at 
> java.management/java.lang.management.ManagementFactory.getPlatformMBeanServer(ManagementFactory.java:479)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.jmx.Server.reregisterMBeansAfterReconfigure(Server.java:140)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.LoggerContext.setConfiguration(LoggerContext.java:559)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:620)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:637)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:231)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:153)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:45)
>[junit4]   2>  at 
> org.apache.logging.log4j.LogManager.getContext(LogManager.java:194)
>[junit4]   2>  at 
> org.apache.logging.log4j.LogManager.getLogger(LogManager.java:581)
>[junit4]   2>  at 
> org.apache.lucene.luke.util.LoggerFactory.getLogger(LoggerFactory.java:70)
>[junit4]   2>  at 
> org.apache.lucene.luke.models.util.IndexUtils.(IndexUtils.java:62)
>[junit4]   2>  at 
> org.apache.lucene.luke.models.LukeModel.(LukeModel.java:60)
>[junit4]   2>  at 
> org.apache.lucene.luke.models.overview.OverviewImpl.(OverviewImpl.java:50)
>[junit4]   2>  at 
> org.apache.lucene.luke.models.overview.OverviewImplTest.testIsOptimized(OverviewImplTest.java:77)
>[junit4]   2>  at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>[junit4]   2>  at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>[junit4]   2>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>[junit4]   2>  at 
> java.base/java.lang.reflect.Method.invoke(Method.java:566)
>[junit4]   2>  at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
>[junit4]   2>  at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
>[junit4]   2>  at 
> 

[jira] [Updated] (LUCENE-8998) OverviewImplTest.testIsOptimized reproducible failure

2019-10-05 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-8998:
--
Attachment: LUCENE-8998.patch
Status: Open  (was: Open)

> OverviewImplTest.testIsOptimized reproducible failure
> -
>
> Key: LUCENE-8998
> URL: https://issues.apache.org/jira/browse/LUCENE-8998
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: luke
>Reporter: Chris M. Hostetter
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-8998.patch
>
>
> The following seed reproduces reliably for me on master...
> (NOTE: the {{ERROR StatusLogger}} messages include the one about the 
> AccessControlException occur even with other seeds when the test passes)
> {noformat}
> [mkdir] Created dir: /home/hossman/lucene/alt_dev/lucene/build/luke/test
> [junit4:pickseed] Seed property 'tests.seed' already defined: 9123DD19C50D658
> [mkdir] Created dir: 
> /home/hossman/lucene/alt_dev/lucene/build/luke/test/temp
>[junit4]  says cześć! Master seed: 9123DD19C50D658
>[junit4] Executing 1 suite with 1 JVM.
>[junit4] 
>[junit4] Started J0 PID(8576@localhost).
>[junit4] Suite: org.apache.lucene.luke.models.overview.OverviewImplTest
>[junit4]   2> ERROR StatusLogger No Log4j 2 configuration file found. 
> Using default configuration (logging only errors to the console), or user 
> programmatically provided configurations. Set system property 'log4j2.debug' 
> to show Log4j 2 internal initialization logging. See 
> https://logging.apache.org/log4j/2.x/manual/configuration.html for 
> instructions on how to configure Log4j 2
>[junit4]   2> ERROR StatusLogger Could not reconfigure JMX
>[junit4]   2>  java.security.AccessControlException: access denied 
> ("javax.management.MBeanServerPermission" "createMBeanServer")
>[junit4]   2>  at 
> java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
>[junit4]   2>  at 
> java.base/java.security.AccessController.checkPermission(AccessController.java:897)
>[junit4]   2>  at 
> java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:322)
>[junit4]   2>  at 
> java.management/java.lang.management.ManagementFactory.getPlatformMBeanServer(ManagementFactory.java:479)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.jmx.Server.reregisterMBeansAfterReconfigure(Server.java:140)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.LoggerContext.setConfiguration(LoggerContext.java:559)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:620)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:637)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:231)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:153)
>[junit4]   2>  at 
> org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:45)
>[junit4]   2>  at 
> org.apache.logging.log4j.LogManager.getContext(LogManager.java:194)
>[junit4]   2>  at 
> org.apache.logging.log4j.LogManager.getLogger(LogManager.java:581)
>[junit4]   2>  at 
> org.apache.lucene.luke.util.LoggerFactory.getLogger(LoggerFactory.java:70)
>[junit4]   2>  at 
> org.apache.lucene.luke.models.util.IndexUtils.(IndexUtils.java:62)
>[junit4]   2>  at 
> org.apache.lucene.luke.models.LukeModel.(LukeModel.java:60)
>[junit4]   2>  at 
> org.apache.lucene.luke.models.overview.OverviewImpl.(OverviewImpl.java:50)
>[junit4]   2>  at 
> org.apache.lucene.luke.models.overview.OverviewImplTest.testIsOptimized(OverviewImplTest.java:77)
>[junit4]   2>  at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>[junit4]   2>  at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>[junit4]   2>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>[junit4]   2>  at 
> java.base/java.lang.reflect.Method.invoke(Method.java:566)
>[junit4]   2>  at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
>[junit4]   2>  at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
>[junit4]   2>  at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
>[junit4]   2>  at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
>

[jira] [Resolved] (LUCENE-9000) Cannot resolve classes from org.apache.lucene.core plugin and others

2019-10-05 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved LUCENE-9000.
---
Resolution: Not A Problem

> Cannot resolve classes from org.apache.lucene.core plugin and others
> 
>
> Key: LUCENE-9000
> URL: https://issues.apache.org/jira/browse/LUCENE-9000
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Affects Versions: 7.1
> Environment: DTP consume that org.apache.lucene.core plugin and 
> trying to compile with the class got this error.
> The compilation error for import of:- import 
> org.apache.lucene.queryParser.ParseException; import 
> org.apache.lucene.queryParser.QueryParser; import 
> org.apache.lucene.search.Searcher;
>  
>  
>  
>Reporter: Rosa Casillas
>Priority: Major
> Fix For: 7.1
>
>
> we are consuming the *org.apache.lucene.core* plugin in our source code. 
> Wherein we updated the *org.apache.lucene.core* version from  *2.9.0 to 
> 7.1.0**(supported by Photon)*.
> But doing that gives us the compilation error in below statements,
>  
> _import org.apache.lucene.queryParser.ParseException;_
>  _import org.apache.lucene.queryParser.QueryParser;_
>  _import org.apache.lucene.search.Searcher;_
>  
> Can you please let us know how to resolve these imports?
>  
>  
> We took a look on the content and noticed that that class is not direct there 
> We tried with Classis class  but even was not able to resolve it, We dont 
> have issue with previous version (2.9.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9000) Cannot resolve classes from org.apache.lucene.core plugin and others

2019-10-05 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16945017#comment-16945017
 ] 

Tomoko Uchida commented on LUCENE-9000:
---

As far as class imports, those lines should work with 7.1.0.
{code:java}
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
{code}
 

Please refer the 7.1.0 Javdocs and post questions to mailing lists instead of 
opening issues, because this Jira isn't a help desk.
 [https://lucene.apache.org/core/7_1_0/]

[https://lucene.apache.org/core/discussion.html]

 

> Cannot resolve classes from org.apache.lucene.core plugin and others
> 
>
> Key: LUCENE-9000
> URL: https://issues.apache.org/jira/browse/LUCENE-9000
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Affects Versions: 7.1
> Environment: DTP consume that org.apache.lucene.core plugin and 
> trying to compile with the class got this error.
> The compilation error for import of:- import 
> org.apache.lucene.queryParser.ParseException; import 
> org.apache.lucene.queryParser.QueryParser; import 
> org.apache.lucene.search.Searcher;
>  
>  
>  
>Reporter: Rosa Casillas
>Priority: Major
> Fix For: 7.1
>
>
> we are consuming the *org.apache.lucene.core* plugin in our source code. 
> Wherein we updated the *org.apache.lucene.core* version from  *2.9.0 to 
> 7.1.0**(supported by Photon)*.
> But doing that gives us the compilation error in below statements,
>  
> _import org.apache.lucene.queryParser.ParseException;_
>  _import org.apache.lucene.queryParser.QueryParser;_
>  _import org.apache.lucene.search.Searcher;_
>  
> Can you please let us know how to resolve these imports?
>  
>  
> We took a look on the content and noticed that that class is not direct there 
> We tried with Classis class  but even was not able to resolve it, We dont 
> have issue with previous version (2.9.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

2019-10-13 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950620#comment-16950620
 ] 

Tomoko Uchida edited comment on LUCENE-9004 at 10/13/19 11:46 PM:
--

I also really look forward to get this feature into Lucene!

FWIW, it seems like that there are several versions of the HNSW paper and I 
think this is the latest revision (v4). Pseudo algorithm parts have been 
refined or evolved from the original version ([~sokolov] introduced here) 
though I've not yet closely checked the diffs and have no idea about they have 
significant impacts here.

[https://arxiv.org/abs/1603.09320]

(I will check out / play with the PoC branch and share my findings, if it would 
be useful.)


was (Author: tomoko uchida):
I also really look forward to get this feature into Lucene!

FWIW, it seems like that there are several versions of the HSNW paper and I 
think this is the latest revision (v4). Pseudo algorithm parts have been 
refined or evolved from the original version ([~sokolov] introduced here) 
though I've not yet closely checked the diffs and have no idea about they have 
significant impacts here.

https://arxiv.org/abs/1603.09320

(I will check out / play with the PoC branch and share my findings, if it would 
be useful.)

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want 

[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2019-10-13 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950620#comment-16950620
 ] 

Tomoko Uchida commented on LUCENE-9004:
---

I also really look forward to get this feature into Lucene!

FWIW, it seems like that there are several versions of the HSNW paper and I 
think this is the latest revision (v4). Pseudo algorithm parts have been 
refined or evolved from the original version ([~sokolov] introduced here) 
though I've not yet closely checked the diffs and have no idea about they have 
significant impacts here.

https://arxiv.org/abs/1603.09320

(I will check out / play with the PoC branch and share my findings, if it would 
be useful.)

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. Mostly it just looks like a vector-valued 
> field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many 
> nocommits, basic design is not really set, there is no Query implementation 
> and no integration iwth IndexSearcher, but it does work by some measure using 
> a standalone test class. I've tested with uniform random vectors and on my 
> laptop indexed 10K documents in around 10 

[jira] [Comment Edited] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-10 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034121#comment-17034121
 ] 

Tomoko Uchida edited comment on LUCENE-9201 at 2/11/20 4:43 AM:


One small thing about the equivalent "ant documentation" (gradle built-in 
Javadoc task or our customized one), 

I think it'd be better the javadoc generation task outputs all javadocs to 
module-wide common directory (e.g., {{lucene/build/docs}} or 
{{solr/build/docs}}) just like ant build does, instead of each module's build 
directory. This makes things easy for succeeding "broken links check" (running 
{{checkJavadocLinks.py}} - or its replacement?) and release managers work that 
should includes updating the official documentation site 
([https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo#ReleaseTodo-Pushdocs,changesandjavadocstotheCMSproductiontree]).


was (Author: tomoko uchida):
One small thing about the equivalent "ant documentation" (gradle built-in 
Javadoc task or our customized one), 

I think it'd be better the javadoc generation task should output all javadocs 
to module-wide common directory (e.g., {{lucene/build/docs}} or 
{{solr/build/docs}}) just ant build does, instead of each module's build 
directory. This makes things easy for succeeding "broken links check" (running 
{{checkJavadocLinks.py}} - or its replacement?) and release managers work that 
should includes updating the official documentation site 
([https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo#ReleaseTodo-Pushdocs,changesandjavadocstotheCMSproductiontree]).

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-23 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-9201:
--
Attachment: LUCENE-9201-missing-docs.patch

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, 
> LUCENE-9201-missing-docs.patch, LUCENE-9201.patch, javadocGRADLE.png, 
> javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-23 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042976#comment-17042976
 ] 

Tomoko Uchida commented on LUCENE-9201:
---

I attached  [^LUCENE-9201-missing-docs.patch], where the only difference to 
[^LUCENE-9201.patch] is this line.
{code:java}
< +dirs.each { dir ->
---
> +dirs.findAll { it.exists() }.each { dir ->
{code}
(I'm not so familiar with groovy though, collect() seems not to be a filter but 
a mapper or iterator. 
[http://docs.groovy-lang.org/next/html/documentation/working-with-collections.html#_iterating_on_a_list])
{quote}The task as it is now doesn't really pass for me (python script fails 
for certain subprojects)? This is expected, right?
{quote}
Yes, this is expected. Some projects (lucene/backward-codecs, 
lucene/queryparser, etc.) have legacy package.html file in their source but 
Gradle Javadoc task ignores them, so the generated Javadocs lack package 
summary. The python linter detects that and correctly fails for now. This is 
the reason why I disabled this task for precommit: 
[https://github.com/apache/lucene-solr/pull/1267/files#diff-5a33a39a6ec8b4facbd4db4cdfb4131a].
 We have to fix Javadoc task (another issue may be needed?), to make the linter 
happy. I think is is fine that we have the "missing docs check" task on the 
master soon, to track how the javadoc task is broken or fixed.

I'd like to commit it to the master tomorrow if there is no disapproval, since 
it's a bit late on JST...

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, 
> LUCENE-9201-missing-docs.patch, LUCENE-9201.patch, javadocGRADLE.png, 
> javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9242) Gradle Javadoc task should output the same documents as Ant

2020-02-25 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-9242:
--
Description: 
"javadoc" task for the Gradle build does not correctly output package 
summaries, since it ignores "package.html" file in the source tree (so the 
Python linter {{checkJavaDocs.py}} detects that and fails for now.)

Also the "javadoc" task should make inter-module links just as Ant build does.

See for more details: LUCENE-9201

  was:
"javadoc" task for the Gradle build does not correctly output package 
summaries, since it ignores "package.html" file in the source tree (so the 
Python linter {{checkJavaDocs.py}} detects that and fails for now.)

See for more details: LUCENE-9201


> Gradle Javadoc task should output the same documents as Ant
> ---
>
> Key: LUCENE-9242
> URL: https://issues.apache.org/jira/browse/LUCENE-9242
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: general/javadocs
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Priority: Major
>
> "javadoc" task for the Gradle build does not correctly output package 
> summaries, since it ignores "package.html" file in the source tree (so the 
> Python linter {{checkJavaDocs.py}} detects that and fails for now.)
> Also the "javadoc" task should make inter-module links just as Ant build does.
> See for more details: LUCENE-9201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9242) Gradle Javadoc task should output the same documents as Ant

2020-02-25 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-9242:
--
Summary: Gradle Javadoc task should output the same documents as Ant  (was: 
Gradle Javadoc task does not include package summaries)

> Gradle Javadoc task should output the same documents as Ant
> ---
>
> Key: LUCENE-9242
> URL: https://issues.apache.org/jira/browse/LUCENE-9242
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: general/javadocs
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Priority: Major
>
> "javadoc" task for the Gradle build does not correctly output package 
> summaries, since it ignores "package.html" file in the source tree (so the 
> Python linter {{checkJavaDocs.py}} detects that and fails for now.)
> See for more details: LUCENE-9201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-25 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044631#comment-17044631
 ] 

Tomoko Uchida commented on LUCENE-9201:
---

I finally got how the "invoke-module-javadoc" macro resolves inter-module links 
(which isn't yet covered by current gradle build).
 I changed the subject for LUCENE-9242 to "Gradle Javadoc task should output 
the same documents as Ant".

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, 
> LUCENE-9201-missing-docs.patch, LUCENE-9201.patch, javadocGRADLE.png, 
> javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14295) Add the parameter descriptionn about "discardCompoundToken" for JapaneseTokenizer

2020-02-29 Thread Tomoko Uchida (Jira)
Tomoko Uchida created SOLR-14295:


 Summary: Add the parameter descriptionn about 
"discardCompoundToken" for JapaneseTokenizer
 Key: SOLR-14295
 URL: https://issues.apache.org/jira/browse/SOLR-14295
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: documentation
Reporter: Tomoko Uchida
Assignee: Tomoko Uchida


In [LUCENE-9123], a parameter {{discardCompoundToken}} was added to 
JapaneseTokenizer(Factory).

The ref-guide needs to be updated to let Solr users know this change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-11746) numeric fields need better error handling for prefix/wildcard syntax -- consider uniform support for "foo:* == foo:[* TO *]"

2020-02-29 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048257#comment-17048257
 ] 

Tomoko Uchida commented on SOLR-11746:
--

It seems like the Ref Guide build is now failing due to the changes here.
{code:java}
solr-ref-guide $ ant build-site
...
build-site:
 [java] Relative link points at id that doesn't exist in dest: 
#differences-between-lucenes-classic-query-parser-and-solrs-standard-query-parser
 [java]  ... source: 
file:/mnt/hdd/repo/lucene-solr/solr/build/solr-ref-guide/html-site/the-standard-query-parser.html
 [java] Relative link points at id that doesn't exist in dest: 
the-standard-query-parser.html#differences-between-lucenes-classic-query-parser-and-solrs-standard-query-parser
 [java]  ... source: 
file:/mnt/hdd/repo/lucene-solr/solr/build/solr-ref-guide/html-site/common-query-parameters.html
 [java] Processed 2611 links (1932 relative) to 3477 anchors in 262 files
 [java] Total of 2 problems found

BUILD FAILED
/mnt/hdd/repo/lucene-solr/solr/solr-ref-guide/build.xml:251: Java returned: 255
{code}
The build works for me when I removed those two lines.
{code:java}
--- a/solr/solr-ref-guide/src/common-query-parameters.adoc
+++ b/solr/solr-ref-guide/src/common-query-parameters.adoc
@@ -102,7 +102,7 @@ fq=+popularity:[10 TO *] +section:0
 
 
 * The document sets from each filter query are cached independently. Thus, 
concerning the previous examples: use a single `fq` containing two mandatory 
clauses if those clauses appear together often, and use two separate `fq` 
parameters if they are relatively independent. (To learn about tuning cache 
sizes and making sure a filter cache actually exists, see 
<>.)
-* It is also possible to use 
<> inside the `fq` to cache clauses individually and - among other 
things - to achieve union of cached filter queries.
+// * It is also possible to use 
<> inside the `fq` to cache clauses individually and - among other 
things - to achieve union of cached filter queries.

diff --git a/solr/solr-ref-guide/src/the-standard-query-parser.adoc 
b/solr/solr-ref-guide/src/the-standard-query-parser.adoc
index c572e503e5b..3a3cd7f958d 100644
--- a/solr/solr-ref-guide/src/the-standard-query-parser.adoc
+++ b/solr/solr-ref-guide/src/the-standard-query-parser.adoc
@@ -174,7 +174,7 @@ The brackets around a query determine its inclusiveness.
 * You can mix these types so one end of the range is inclusive and the other 
is exclusive. Here's an example: `count:{1 TO 10]`
 
 Wildcards, `*`, can also be used for either or both endpoints to specify an 
open-ended range query.
-This is a 
<<#differences-between-lucenes-classic-query-parser-and-solrs-standard-query-parser,divergence
 from Lucene's Classic Query Parser>>.
+// This is a 
<<#differences-between-lucenes-classic-query-parser-and-solrs-standard-query-parser,divergence
 from Lucene's Classic Query Parser>>.
{code}
I know nothing about this issue, just noticed the broken links when I updated 
the ref-guide on another issue...

> numeric fields need better error handling for prefix/wildcard syntax -- 
> consider uniform support for "foo:* == foo:[* TO *]"
> 
>
> Key: SOLR-11746
> URL: https://issues.apache.org/jira/browse/SOLR-11746
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.0
>Reporter: Chris M. Hostetter
>Assignee: Houston Putman
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, 
> SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, 
> SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch
>
>
> On the solr-user mailing list, Torsten Krah pointed out that with Trie 
> numeric fields, query syntax such as {{foo_d:\*}} has been functionality 
> equivilent to {{foo_d:\[\* TO \*]}} and asked why this was not also supported 
> for Point based numeric fields.
> The fact that this type of syntax works (for {{indexed="true"}} Trie fields) 
> appears to have been an (untested, undocumented) fluke of Trie fields given 
> that they use indexed terms for the (encoded) numeric terms and inherit the 
> default implementation of {{FieldType.getPrefixQuery}} which produces a 
> prefix query against the {{""}} (empty string) term.  
> (Note that this syntax has aparently _*never*_ worked for Trie fields with 
> {{indexed="false" docValues="true"}} )
> In general, we should assess the behavior users attempt a prefix/wildcard 
> syntax query against numeric fields, as currently the behavior is largely 
> non-sensical:  prefix/wildcard syntax frequently match no docs w/o any sort 
> of error, and the aformentioned {{numeric_field:*}} behaves inconsistently 

[jira] [Updated] (SOLR-14295) Add the parameter description about "discardCompoundToken" for JapaneseTokenizer

2020-02-29 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated SOLR-14295:
-
Summary: Add the parameter description about "discardCompoundToken" for 
JapaneseTokenizer  (was: Add the parameter descriptionn about 
"discardCompoundToken" for JapaneseTokenizer)

> Add the parameter description about "discardCompoundToken" for 
> JapaneseTokenizer
> 
>
> Key: SOLR-14295
> URL: https://issues.apache.org/jira/browse/SOLR-14295
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Minor
> Attachments: SOLR-14295.patch
>
>
> In [LUCENE-9123], a parameter {{discardCompoundToken}} was added to 
> JapaneseTokenizer(Factory).
> The ref-guide needs to be updated to let Solr users know this change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14295) Add the parameter description about "discardCompoundToken" for JapaneseTokenizer

2020-02-29 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated SOLR-14295:
-
Description: 
In [LUCENE-9123], a parameter {{discardCompoundToken}} was added to 
JapaneseTokenizer(Factory).

The ref-guide needs to be updated to let Solr users know about this change.

  was:
In [LUCENE-9123], a parameter {{discardCompoundToken}} was added to 
JapaneseTokenizer(Factory).

The ref-guide needs to be updated to let Solr users know this change.


> Add the parameter description about "discardCompoundToken" for 
> JapaneseTokenizer
> 
>
> Key: SOLR-14295
> URL: https://issues.apache.org/jira/browse/SOLR-14295
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Minor
> Attachments: SOLR-14295.patch
>
>
> In [LUCENE-9123], a parameter {{discardCompoundToken}} was added to 
> JapaneseTokenizer(Factory).
> The ref-guide needs to be updated to let Solr users know about this change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14295) Add the parameter description about "discardCompoundToken" for JapaneseTokenizer

2020-02-29 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved SOLR-14295.
--
Fix Version/s: 8.5
   master (9.0)
   Resolution: Fixed

> Add the parameter description about "discardCompoundToken" for 
> JapaneseTokenizer
> 
>
> Key: SOLR-14295
> URL: https://issues.apache.org/jira/browse/SOLR-14295
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Minor
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14295.patch
>
>
> In [LUCENE-9123], a parameter {{discardCompoundToken}} was added to 
> JapaneseTokenizer(Factory).
> The ref-guide needs to be updated to let Solr users know about this change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14295) Add the parameter descriptionn about "discardCompoundToken" for JapaneseTokenizer

2020-02-29 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated SOLR-14295:
-
Attachment: SOLR-14295.patch

> Add the parameter descriptionn about "discardCompoundToken" for 
> JapaneseTokenizer
> -
>
> Key: SOLR-14295
> URL: https://issues.apache.org/jira/browse/SOLR-14295
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Minor
> Attachments: SOLR-14295.patch
>
>
> In [LUCENE-9123], a parameter {{discardCompoundToken}} was added to 
> JapaneseTokenizer(Factory).
> The ref-guide needs to be updated to let Solr users know this change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9242) Gradle Javadoc task should output the same documents as Ant

2020-03-01 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048696#comment-17048696
 ] 

Tomoko Uchida edited comment on LUCENE-9242 at 3/1/20 10:37 PM:


I opened a draft PR [https://github.com/apache/lucene-solr/pull/1304]. This 
adds a gradle task, named {{invokeJavadoc}}, which generates Javadocs with 
inter-module hyperlinks by invoking Ant javadoc task. Also this passes 
{{checkMissingJavadocs}} check.

The task can be called as below:
{code:java}
# generate javadocs for each project
$ ./gradlew :lucene:core:invokeJavadoc
{code}
or,
{code:java}
# generate javadocs for all projects at once
$ ./gradlew invokeJavadoc
{code}
The work isn't completed yet, but the most important parts are already ported.

 

Quick replies to comments on LUCENE-9201 will be following:
{quote}It is my personal preference to have a project-scope granularity. This 
way you can run project-scoped task (like gradlew -p lucene/core javadoc). My 
personal take on assembling "distributions" is to have a separate project that 
just takes what it needs from other projects and puts it together (with any 
tweaks required). This makes it easier to reason about how a distribution is 
assembled and from where, while each project just takes care of itself.
{quote}
I'd love this approach, however, when I was trying I noticed that it looks 
difficult to properly generate inter-module hyperlinks without affecting the 
existing javadoc's path hierarchy (already published on the apache.org web 
site), if we want to place generated javadocs under 
${sub_project_root}/build/docs/javadoc (gradle's default javadoc destination). 
The fundamental problem here I think is, in order to make hyperlinks from a 
module A to another module B, we need to know the effective relative path from 
module A to module B and pass it to the Javadoc Tool.

I aggregated all javadocs into {{lucene/build/docs}} or {{solr/build/docs}}, 
just as the Ant build does, to resolve the relative paths. I might miss 
something - please let me know if my understanding isn't correct.
{quote}for "directly call the javadoc tool" we may want to use the ant task as 
a start. This ant task is doing quite a bit of work above and beyond what the 
tool is doing (if you look at the relevant code to ant, you may be shocked!).
{quote}
As the first step I tried to reproduce the principal Ant macros : 
"invoke-javadoc" (in lucene/common-build.xml) and "invoke-module-javadoc" (in 
lucene/module-build.xml) on gradle build. By doing so, there's now no missing 
package summaries and inter-module links will be generated. (Current setups to 
resolve the hyperlinks look quite redundant, I think we can do it in more 
sophisticated ways.)
{quote}A custom javadoc invocation is certainly possible and could possibly 
make things easier in the long run.
{quote}
{quote}as a second step you can look at computing package list for a module 
yourself (it may allow invoking the tool directly).
{quote}
Yes we will probably be able to throw away all ant tasks and only rely on pure 
gradle code. Some extra effort will be needed to faithfully transfer the 
elaborate ant setups into corresponding gradle scripts...
{quote}You'd need to declare inputs/ outputs properly though so that it is 
skippable. Those javadoc invocations take a long time in precommit.
{quote}
I passed inputs/outputs to the task not to needlessly repeat the javadoc 
invocation. It seems to work - {{ant.javadoc}} is called only when the java 
source or output directory is changed.


was (Author: tomoko uchida):
I opened a draft PR [https://github.com/apache/lucene-solr/pull/1304]. This 
adds a gradle task, named {{invokeJavadoc}}, which generates Javadocs with 
inter-module hyperlinks by invoking Ant javadoc task. Also this passes 
{{checkMissingJavadocs}} check.

The task can be called as below:
{code:java}
# generate javadocs for each project
$ ./gradlew :lucene:core:invokeJavadoc
{code}
or,
{code:java}
# generate javadocs for all projects at once
$ ./gradlew invokeJavadoc
{code}
The work isn't completed yet, but the most important parts are already ported.

 

Quick replies to comments on LUCENE-9201 will be following:
{quote}It is my personal preference to have a project-scope granularity. This 
way you can run project-scoped task (like gradlew -p lucene/core javadoc). My 
personal take on assembling "distributions" is to have a separate project that 
just takes what it needs from other projects and puts it together (with any 
tweaks required). This makes it easier to reason about how a distribution is 
assembled and from where, while each project just takes care of itself.
{quote}
I'd love this approach, however, when I was trying I noticed that it looks 
difficult to properly generate inter-module hyperlinks without affecting the 
existing javadoc's path hierarchy (already published on the apache.org web 

[jira] [Comment Edited] (LUCENE-9242) Gradle Javadoc task should output the same documents as Ant

2020-03-01 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048696#comment-17048696
 ] 

Tomoko Uchida edited comment on LUCENE-9242 at 3/1/20 10:38 PM:


I opened a draft PR [https://github.com/apache/lucene-solr/pull/1304]. This 
adds a gradle task, named {{invokeJavadoc}}, which generates Javadocs with 
inter-module hyperlinks by invoking Ant javadoc task. Also this passes 
{{checkMissingJavadocs}} check.

The task can be called as below:
{code:java}
# generate javadocs for each project
$ ./gradlew :lucene:core:invokeJavadoc
{code}
or,
{code:java}
# generate javadocs for all projects at once
$ ./gradlew invokeJavadoc
{code}
The work isn't completed yet, but the most important parts are already ported.

 

Quick replies to comments on LUCENE-9201 will be following:
{quote}It is my personal preference to have a project-scope granularity. This 
way you can run project-scoped task (like gradlew -p lucene/core javadoc). My 
personal take on assembling "distributions" is to have a separate project that 
just takes what it needs from other projects and puts it together (with any 
tweaks required). This makes it easier to reason about how a distribution is 
assembled and from where, while each project just takes care of itself.
{quote}
I'd love this approach, however, when I was trying I noticed that it looks 
difficult to properly generate inter-module hyperlinks without affecting the 
existing javadoc's path hierarchy (already published on the apache.org web 
site), if we want to place generated javadocs under 
${sub_project_root}/build/docs/javadoc (gradle's default javadoc destination). 
The fundamental problem here I think is, in order to make hyperlinks from a 
module A to another module B, we need to know the effective relative path from 
module A to module B and pass it to the Javadoc Tool.

I aggregated all javadocs into {{lucene/build/docs}} or {{solr/build/docs}}, 
just as the Ant build does, to resolve the relative paths. I might miss 
something - please let me know if my understanding isn't correct.

 
{quote}for "directly call the javadoc tool" we may want to use the ant task as 
a start. This ant task is doing quite a bit of work above and beyond what the 
tool is doing (if you look at the relevant code to ant, you may be shocked!).
{quote}
As the first step I tried to reproduce the principal Ant macros : 
"invoke-javadoc" (in lucene/common-build.xml) and "invoke-module-javadoc" (in 
lucene/module-build.xml) on gradle build. By doing so, there's now no missing 
package summaries and inter-module links will be generated. (Current setups to 
resolve the hyperlinks look quite redundant, I think we can do it in more 
sophisticated ways.)

 
{quote}A custom javadoc invocation is certainly possible and could possibly 
make things easier in the long run.
{quote}
{quote}as a second step you can look at computing package list for a module 
yourself (it may allow invoking the tool directly).
{quote}
Yes we will probably be able to throw away all ant tasks and only rely on pure 
gradle code. Some extra effort will be needed to faithfully transfer the 
elaborate ant setups into corresponding gradle scripts...

 
{quote}You'd need to declare inputs/ outputs properly though so that it is 
skippable. Those javadoc invocations take a long time in precommit.
{quote}
I passed inputs/outputs to the task not to needlessly repeat the javadoc 
invocation. It seems to work - {{ant.javadoc}} is called only when the java 
source or output directory is changed.


was (Author: tomoko uchida):
I opened a draft PR [https://github.com/apache/lucene-solr/pull/1304]. This 
adds a gradle task, named {{invokeJavadoc}}, which generates Javadocs with 
inter-module hyperlinks by invoking Ant javadoc task. Also this passes 
{{checkMissingJavadocs}} check.

The task can be called as below:
{code:java}
# generate javadocs for each project
$ ./gradlew :lucene:core:invokeJavadoc
{code}
or,
{code:java}
# generate javadocs for all projects at once
$ ./gradlew invokeJavadoc
{code}
The work isn't completed yet, but the most important parts are already ported.

 

Quick replies to comments on LUCENE-9201 will be following:
{quote}It is my personal preference to have a project-scope granularity. This 
way you can run project-scoped task (like gradlew -p lucene/core javadoc). My 
personal take on assembling "distributions" is to have a separate project that 
just takes what it needs from other projects and puts it together (with any 
tweaks required). This makes it easier to reason about how a distribution is 
assembled and from where, while each project just takes care of itself.
{quote}
I'd love this approach, however, when I was trying I noticed that it looks 
difficult to properly generate inter-module hyperlinks without affecting the 
existing javadoc's path hierarchy (already published on the apache.org 

[jira] [Commented] (LUCENE-9242) Gradle Javadoc task should output the same documents as Ant

2020-03-01 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048696#comment-17048696
 ] 

Tomoko Uchida commented on LUCENE-9242:
---

I opened a draft PR [https://github.com/apache/lucene-solr/pull/1304]. This 
adds a gradle task, named {{invokeJavadoc}}, which generates Javadocs with 
inter-module hyperlinks by invoking Ant javadoc task. Also this passes 
{{checkMissingJavadocs}} check.

The task can be called as below:
{code:java}
# generate javadocs for each project
$ ./gradlew :lucene:core:invokeJavadoc
{code}
or,
{code:java}
# generate javadocs for all projects at once
$ ./gradlew invokeJavadoc
{code}
The work isn't completed yet, but the most important parts are already ported.

 

Quick replies to comments on LUCENE-9201 will be following:
{quote}It is my personal preference to have a project-scope granularity. This 
way you can run project-scoped task (like gradlew -p lucene/core javadoc). My 
personal take on assembling "distributions" is to have a separate project that 
just takes what it needs from other projects and puts it together (with any 
tweaks required). This makes it easier to reason about how a distribution is 
assembled and from where, while each project just takes care of itself.
{quote}
I'd love this approach, however, when I was trying I noticed that it looks 
difficult to properly generate inter-module hyperlinks without affecting the 
existing javadoc's path hierarchy (already published on the apache.org web 
site), if we want to place generated javadocs under 
${sub_project_root}/build/docs/javadoc (gradle's default javadoc destination). 
The fundamental problem here I think is, in order to make hyperlinks from a 
module A to another module B, we need to know the effective relative path from 
module A to module B and pass it to the Javadoc Tool.

I aggregated all javadocs into {{lucene/build/docs}} or {{solr/build/docs}}, 
just as the Ant build does, to resolve the relative paths. I might miss 
something - please let me know if my understanding isn't correct.
{quote}for "directly call the javadoc tool" we may want to use the ant task as 
a start. This ant task is doing quite a bit of work above and beyond what the 
tool is doing (if you look at the relevant code to ant, you may be shocked!).
{quote}
As the first step I tried to reproduce the principal Ant macros : 
"invoke-javadoc" (in lucene/common-build.xml) and "invoke-module-javadoc" (in 
lucene/module-build.xml) on gradle build. By doing so, there's now no missing 
package summaries and inter-module links will be generated. (Current setups to 
resolve the hyper links looks quite redundant, I think we can do it more 
sophisticated ways.)
{quote}A custom javadoc invocation is certainly possible and could possibly 
make things easier in the long run.
{quote}
{quote}as a second step you can look at computing package list for a module 
yourself (it may allow invoking the tool directly).
{quote}
Yes we will probably be able to throw away all ant tasks and only rely on pure 
gradle code. Some extra effort will be needed to faithfully transfer the 
elaborate ant setups into corresponding gradle scripts...
{quote}You'd need to declare inputs/ outputs properly though so that it is 
skippable. Those javadoc invocations take a long time in precommit.
{quote}
I passed inputs/outputs to the task not to needlessly repeat the javadoc 
invocation. It seems to work - {{ant.javadoc}} is called only when the java 
source or output directory is changed.

> Gradle Javadoc task should output the same documents as Ant
> ---
>
> Key: LUCENE-9242
> URL: https://issues.apache.org/jira/browse/LUCENE-9242
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: general/javadocs
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> "javadoc" task for the Gradle build does not correctly output package 
> summaries, since it ignores "package.html" file in the source tree (so the 
> Python linter {{checkJavaDocs.py}} detects that and fails for now.)
> Also the "javadoc" task should make inter-module links just as Ant build does.
> See for more details: LUCENE-9201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9242) Gradle Javadoc task should output the same documents as Ant

2020-03-01 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048702#comment-17048702
 ] 

Tomoko Uchida commented on LUCENE-9242:
---

[~dweiss] [~rcmuir] Could you take a look at the PR? It seems to work for me 
but I am not sure if this is a good start or not, any thoughts or brief 
comments are welcomed.

> Gradle Javadoc task should output the same documents as Ant
> ---
>
> Key: LUCENE-9242
> URL: https://issues.apache.org/jira/browse/LUCENE-9242
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: general/javadocs
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> "javadoc" task for the Gradle build does not correctly output package 
> summaries, since it ignores "package.html" file in the source tree (so the 
> Python linter {{checkJavaDocs.py}} detects that and fails for now.)
> Also the "javadoc" task should make inter-module links just as Ant build does.
> See for more details: LUCENE-9201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-21 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020273#comment-17020273
 ] 

Tomoko Uchida commented on LUCENE-9123:
---

Thank you [~cm] and [~johtani] for your comments.

Because fixing the synonym filter would not be easy and may take time, so I 
think the patch is a good quick-fix for the majority of users. And yes, we 
could recommend UniDic dictionary instead of search mode, when we resolve 
https://issues.apache.org/jira/browse/LUCENE-4056 and 
https://issues.apache.org/jira/browse/LUCENE-8816.

About n-best mode, I am not sure we should consider about it here. The 
"emitting n-best tokens" and "emitting compound tokens on search mode" are 
different concept, though both emits multiple tokens at the same position. As 
far as I read the code, they are orthogonal. (Maybe we should open another 
issue for n-best, but it seems to be difficult to find a solution for that 
without fixing the synonym filter.)

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-03 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050924#comment-17050924
 ] 

Tomoko Uchida commented on LUCENE-9136:
---

[~jtibshirani] [~irvingzhang] thanks for your hard work here!
{quote}I was thinking we could actually reuse the existing `PostingsFormat` and 
`DocValuesFormat` implementations.
{quote}
Actually the first implementation (by Michael Sokolov) for the HNSW was 
wrapping DocValuesFormat to avoid code duplication. However, this approach - 
reusing existing code - could lead another concern from the perspective of 
maintenance. (From the beginning, Adrien Grand suggested a dedicated format 
instead of hacking doc values.) This is the main reason why I introduced a new 
format for knn search in LUCENE-9004.

I'm not strongly against to the "reusing existing format" strategy if it's the 
best way here, just would like to share my feeling that it could be a bit 
controversial and you might need to convince maintainers that the (pretty new) 
feature does not cause any problems/concerns on future maintenance for Lucene 
core, if you implement it on existing formats/readers.

I have not closely looked at your PR yet - sorry if my comments completely 
besides the point (you might already talk with other committers  about the 
implementation in another channel, eg. private chats?).

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, 
> image-2020-02-16-15-05-02-451.png
>
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--

[jira] [Commented] (LUCENE-9259) NGramFilter use wrong argument name for preserve option

2020-03-04 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051311#comment-17051311
 ] 

Tomoko Uchida commented on LUCENE-9259:
---

The added tests and ref-guide examples also look fine to me. I'm planning to 
commit the patch to the master and branch_8x in shortly.

[~Paul Pazderski] Can I ask your email address for crediting?

 

> NGramFilter use wrong argument name for preserve option
> ---
>
> Key: LUCENE-9259
> URL: https://issues.apache.org/jira/browse/LUCENE-9259
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.4, 8.0
>Reporter: Paul Pazderski
>Priority: Minor
> Attachments: LUCENE-9259.patch
>
>
> LUCENE-7960 added the possibility to preserve the original term when using 
> NGram filters. The documentation says to enable it with 'preserveOriginal' 
> and it works for EdgeNGram filter. But NGram filter requires the initial 
> planned option 'keepShortTerms' to enable this feature.
> This inconsistency is confusing. I'll provide a patch with a possible fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9242) Gradle Javadoc task should output the same documents as Ant

2020-03-02 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049160#comment-17049160
 ] 

Tomoko Uchida edited comment on LUCENE-9242 at 3/2/20 1:47 PM:
---

My description about the inter-module links was not good... let me share an 
example to clarify a bit the problem I am bumping (though not sure if such 
redundant explanation is needed here).

There is a link from {{o.a.l.a.core.KeywordAnalyzer}} to {{o.a.l.a.Analyzer}}. 
The {{KeywordAnalyzer}} is placed in "analysis/common" project and {{Analyzer}} 
is placed in "core" project so the link is an inter-module (inter-project) link 
from "analysis/common" to "core".

[https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html]

The link is represented by a relative path: {{}}, this is 
automatically generated by Ant javadoc task ({{nested  element}}). The 
link element (without "offline" option) raises errors when the "href" path or 
URL cannot be resolved, so you need to be sure to the path is substantial when 
invoking javadoc task. This means you have to prepare the very same folder 
structure as you'd like to publish at the point of executing javadoc.

Still we can create absolute URL links with "offline" option if the relative 
path is not available (Solr has many absolute URL links to Lucene javadocs), 
but I feel it's a bit much for inter-project links...


was (Author: tomoko uchida):
My description about the inter-module links was not good... let me share an 
example to clarify a bit the problem I am bumping (though not sure if such 
redundant explanation is needed here).

There is a link from {{o.a.l.a.core.KeywordAnalyzer}} to {{o.a.l.a.Analyzer}}. 
The {{KeywordAnalyzer}} is placed in "analysis/common" project and {{Analyzer}} 
is placed in "core" project so the link is an inter-module (inter-project) link 
from "analysis/common" to "core".

[https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html]

The link is represented by a relative path: {{}}, this is automatically 
generated by Ant javadoc task ({{nested  element}}). The link element 
(without "offline" option) raises errors when the "href" path or URL cannot be 
resolved, so you need to be sure to the path is substantial when invoking 
javadoc task. This means you have to prepare the very same folder structure as 
you'd like to publish at the point of executing javadoc.

Still we can create absolute URL links with "offline" option if the relative 
path is not available (Solr has many absolute URL links to Lucene javadocs), 
but I feel it's a bit much for inter-project links...

> Gradle Javadoc task should output the same documents as Ant
> ---
>
> Key: LUCENE-9242
> URL: https://issues.apache.org/jira/browse/LUCENE-9242
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: general/javadocs
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> "javadoc" task for the Gradle build does not correctly output package 
> summaries, since it ignores "package.html" file in the source tree (so the 
> Python linter {{checkJavaDocs.py}} detects that and fails for now.)
> Also the "javadoc" task should make inter-module links just as Ant build does.
> See for more details: LUCENE-9201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9242) Gradle Javadoc task should output the same documents as Ant

2020-03-02 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049160#comment-17049160
 ] 

Tomoko Uchida commented on LUCENE-9242:
---

My description about the inter-module links was not good... let me share an 
example to clarify a bit the problem I am bumping (though not sure if such 
redundant explanation is needed here).

There is a link from {{o.a.l.a.core.KeywordAnalyzer}} to {{o.a.l.a.Analyzer}}. 
The {{KeywordAnalyzer}} is placed in "analysis/common" project and {{Analyzer}} 
is placed in "core" project so the link is an inter-module (inter-project) link 
from "analysis/common" to "core".

[https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html]

The link is represented by a relative path: {{}}, this is automatically 
generated by Ant javadoc task ({{nested  element}}). The link element 
(without "offline" option) raises errors when the "href" path or URL cannot be 
resolved, so you need to be sure to the path is substantial when invoking 
javadoc task. This means you have to prepare the very same folder structure as 
you'd like to publish at the point of executing javadoc.

Still we can create absolute URL links with "offline" option if the relative 
path is not available (Solr has many absolute URL links to Lucene javadocs), 
but I feel it's a bit much for inter-project links...

> Gradle Javadoc task should output the same documents as Ant
> ---
>
> Key: LUCENE-9242
> URL: https://issues.apache.org/jira/browse/LUCENE-9242
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: general/javadocs
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> "javadoc" task for the Gradle build does not correctly output package 
> summaries, since it ignores "package.html" file in the source tree (so the 
> Python linter {{checkJavaDocs.py}} detects that and fails for now.)
> Also the "javadoc" task should make inter-module links just as Ant build does.
> See for more details: LUCENE-9201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9259) NGramFilter use wrong argument name for preserve option

2020-03-03 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050263#comment-17050263
 ] 

Tomoko Uchida commented on LUCENE-9259:
---

[~Paul Pazderski] thanks, good catch. The fix looks good to me (this preserve 
backward compatibility), let me check the tests and documentation.

> NGramFilter use wrong argument name for preserve option
> ---
>
> Key: LUCENE-9259
> URL: https://issues.apache.org/jira/browse/LUCENE-9259
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.4, 8.0
>Reporter: Paul Pazderski
>Priority: Minor
> Attachments: LUCENE-9259.patch
>
>
> LUCENE-7960 added the possibility to preserve the original term when using 
> NGram filters. The documentation says to enable it with 'preserveOriginal' 
> and it works for EdgeNGram filter. But NGram filter requires the initial 
> planned option 'keepShortTerms' to enable this feature.
> This inconsistency is confusing. I'll provide a patch with a possible fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-11746) numeric fields need better error handling for prefix/wildcard syntax -- consider uniform support for "foo:* == foo:[* TO *]"

2020-03-03 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049991#comment-17049991
 ] 

Tomoko Uchida commented on SOLR-11746:
--

I did clean checkout from the Apache repo, and saw the build still doesn't work 
for me.

{code}
git clone g...@github.com:apache/lucene-solr.git
cd lucene-solr/
ant -f solr/solr-ref-guide/build.xml build-site
...
build-site:
 [java] Relative link points at id that doesn't exist in dest: 
the-standard-query-parser.html#differences-between-lucenes-classic-query-parser-and-solrs-standard-query-parser
 [java]  ... source: 
file:/mnt/hdd/tmp/lucene-solr/solr/build/solr-ref-guide/html-site/common-query-parameters.html
 [java] Relative link points at id that doesn't exist in dest: 
#differences-between-lucenes-classic-query-parser-and-solrs-standard-query-parser
 [java]  ... source: 
file:/mnt/hdd/tmp/lucene-solr/solr/build/solr-ref-guide/html-site/the-standard-query-parser.html
 [java] Processed 2610 links (1930 relative) to 3728 anchors in 262 files
 [java] Total of 2 problems found

BUILD FAILED
{code}

I have not looked at the details yet, but the problem could occur only for me 
(caused by my setups?).

> numeric fields need better error handling for prefix/wildcard syntax -- 
> consider uniform support for "foo:* == foo:[* TO *]"
> 
>
> Key: SOLR-11746
> URL: https://issues.apache.org/jira/browse/SOLR-11746
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.0
>Reporter: Chris M. Hostetter
>Assignee: Houston Putman
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, 
> SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, 
> SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch
>
>
> On the solr-user mailing list, Torsten Krah pointed out that with Trie 
> numeric fields, query syntax such as {{foo_d:\*}} has been functionality 
> equivilent to {{foo_d:\[\* TO \*]}} and asked why this was not also supported 
> for Point based numeric fields.
> The fact that this type of syntax works (for {{indexed="true"}} Trie fields) 
> appears to have been an (untested, undocumented) fluke of Trie fields given 
> that they use indexed terms for the (encoded) numeric terms and inherit the 
> default implementation of {{FieldType.getPrefixQuery}} which produces a 
> prefix query against the {{""}} (empty string) term.  
> (Note that this syntax has aparently _*never*_ worked for Trie fields with 
> {{indexed="false" docValues="true"}} )
> In general, we should assess the behavior users attempt a prefix/wildcard 
> syntax query against numeric fields, as currently the behavior is largely 
> non-sensical:  prefix/wildcard syntax frequently match no docs w/o any sort 
> of error, and the aformentioned {{numeric_field:*}} behaves inconsistently 
> between points/trie fields and between indexed/docValued trie fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-26 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023757#comment-17023757
 ] 

Tomoko Uchida commented on LUCENE-9123:
---

I opened an issue for the SynonymGraphFilter: LUCENE-9173. Also I found an 
issue about multi-word synonyms LUCENE-8137, it seems like it's a different 
issue discussed here (but I'm not fully sure of that).

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_8x.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9173) SynonymGraphFilter doesn't correctly consume decompounded tokens (branched token graph)

2020-01-26 Thread Tomoko Uchida (Jira)
Tomoko Uchida created LUCENE-9173:
-

 Summary: SynonymGraphFilter doesn't correctly consume decompounded 
tokens  (branched token graph)
 Key: LUCENE-9173
 URL: https://issues.apache.org/jira/browse/LUCENE-9173
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Reporter: Tomoko Uchida


This is a derived issue from LUCENE-9123.

When the tokenizer that is given to SynonymGraphFilter decompound tokens or 
emit multiple tokens at the same position, SynonymGraphFilter cannot correctly 
handle them (an exception will be thrown).

For example, JapaneseTokenizer (mode=SEARCH) would emit a token and two 
decompounded tokens for the text "株式会社":
{code:java}
株式会社 (positionIncrement=0, positionLength=2)
株式 (positionIncrement=1, positionLength=1)
会社 (positionIncrement=1, positionLength=1)
{code}
Then if we give synonym "株式会社,コーポレーション" by SynonymGraphFilter (set 
tokenizerFactory=JapaneseTokenizerFactory) this exception is thrown.
{code:java}
Caused by: java.lang.IllegalArgumentException: term: 株式会社 analyzed to a token 
(株式会社) with position increment != 1 (got: 0)
at 
org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:325)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
at 
org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
at 
org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
at 
org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.loadSynonyms(SynonymGraphFilterFactory.java:179)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
at 
org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.inform(SynonymGraphFilterFactory.java:154)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
{code}
This isn't only limited to JapaneseTokenizer but a more general issue about 
handling branched token graph (decompounded tokens in the midstream).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-26 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023778#comment-17023778
 ] 

Tomoko Uchida commented on LUCENE-9123:
---

When reproducing this issue I noticed that JapaneseTokenizer (mode=search) 
gives positionIncrements=1 for the decompounded token "株式" instead of 0. This 
looks strange to me, is this an expected behaviour? If not, this may affect the 
synonyms handling?

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_8x.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-26 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023863#comment-17023863
 ] 

Tomoko Uchida commented on LUCENE-9123:
---

Thanks [~h.kazuaki] for updating the patches. +1, I will commit them with 
CHANGES and MIGRATE entries next weekend or so  (sorry for the delay, I may not 
have time to test them locally right now). Meanwhile can you tell us the e-mail 
address that will be logged as the author of the patch?

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_8x.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9173) SynonymGraphFilter doesn't correctly consume decompounded tokens (branched token graph)

2020-01-26 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-9173:
--
Description: 
This is a derived issue from LUCENE-9123.

When the tokenizer that is given to SynonymGraphFilter decompound tokens or 
emit multiple tokens at the same position, SynonymGraphFilter cannot correctly 
handle them (an exception will be thrown).

For example, JapaneseTokenizer (mode=SEARCH) would emit a token and two 
decompounded tokens for the text "株式会社":
{code:java}
株式会社 (positionIncrement=0, positionLength=2)
株式 (positionIncrement=1, positionLength=1)
会社 (positionIncrement=1, positionLength=1)
{code}
Then if we give a synonym "株式会社,コーポレーション" by SynonymGraphFilterFactory (set 
tokenizerFactory=JapaneseTokenizerFactory) this exception is thrown.
{code:java}
Caused by: java.lang.IllegalArgumentException: term: 株式会社 analyzed to a token 
(株式会社) with position increment != 1 (got: 0)
at 
org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:325)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
at 
org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
at 
org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
at 
org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.loadSynonyms(SynonymGraphFilterFactory.java:179)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
at 
org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.inform(SynonymGraphFilterFactory.java:154)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
{code}
This isn't only limited to JapaneseTokenizer but a more general issue about 
handling branched token graph (decompounded tokens in the midstream).

  was:
This is a derived issue from LUCENE-9123.

When the tokenizer that is given to SynonymGraphFilter decompound tokens or 
emit multiple tokens at the same position, SynonymGraphFilter cannot correctly 
handle them (an exception will be thrown).

For example, JapaneseTokenizer (mode=SEARCH) would emit a token and two 
decompounded tokens for the text "株式会社":
{code:java}
株式会社 (positionIncrement=0, positionLength=2)
株式 (positionIncrement=1, positionLength=1)
会社 (positionIncrement=1, positionLength=1)
{code}
Then if we give synonym "株式会社,コーポレーション" by SynonymGraphFilter (set 
tokenizerFactory=JapaneseTokenizerFactory) this exception is thrown.
{code:java}
Caused by: java.lang.IllegalArgumentException: term: 株式会社 analyzed to a token 
(株式会社) with position increment != 1 (got: 0)
at 
org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:325)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
at 
org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
at 
org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
at 
org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.loadSynonyms(SynonymGraphFilterFactory.java:179)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
at 
org.apache.lucene.analysis.synonym.SynonymGraphFilterFactory.inform(SynonymGraphFilterFactory.java:154)
 ~[lucene-analyzers-common-8.4.0.jar:8.4.0 
bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:38]
{code}
This isn't only limited to JapaneseTokenizer but a more general issue about 
handling branched token graph (decompounded tokens in the midstream).


> SynonymGraphFilter doesn't correctly consume decompounded tokens  (branched 
> token graph)
> 
>
> Key: LUCENE-9173
> URL: https://issues.apache.org/jira/browse/LUCENE-9173
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Minor
>
> This is a derived issue from LUCENE-9123.
> When the tokenizer that is given to SynonymGraphFilter decompound tokens or 
> emit multiple tokens at the same position, 

[jira] [Comment Edited] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-26 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023778#comment-17023778
 ] 

Tomoko Uchida edited comment on LUCENE-9123 at 1/26/20 6:47 PM:


When reproducing this issue I noticed that JapaneseTokenizer (mode=search) 
gives positionIncrements=1 for the decompounded token "株式" instead of 0. This 
looks strange to me, is this an expected behaviour? If not, this may affect the 
synonyms handling?

And please ignore my previous comment... I was mistsken about position 
increment.


was (Author: tomoko uchida):
When reproducing this issue I noticed that JapaneseTokenizer (mode=search) 
gives positionIncrements=1 for the decompounded token "株式" instead of 0. This 
looks strange to me, is this an expected behaviour? If not, this may affect the 
synonyms handling?

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_8x.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-26 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023778#comment-17023778
 ] 

Tomoko Uchida edited comment on LUCENE-9123 at 1/26/20 6:49 PM:


When reproducing this issue I noticed that JapaneseTokenizer (mode=search) 
gives positionIncrements=1 for the decompounded token "株式" instead of 0. This 
looks strange to me, is this an expected behaviour? If not, this may affect the 
synonyms handling?

please ignore above my comment... I was mistsken about position increment.


was (Author: tomoko uchida):
When reproducing this issue I noticed that JapaneseTokenizer (mode=search) 
gives positionIncrements=1 for the decompounded token "株式" instead of 0. This 
looks strange to me, is this an expected behaviour? If not, this may affect the 
synonyms handling?

And please ignore my previous comment... I was mistsken about position 
increment.

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_8x.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2020-02-05 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031275#comment-17031275
 ] 

Tomoko Uchida commented on LUCENE-9004:
---

The context is created per reader basis, not per query basis. You don't share 
your test code, but I suspect you open new IndexReader every time you issue a 
query? I think if you reuse one index reader (index searcher) through the test, 
the memory usage is stable between 2 and 4 GB. 
Anyway, yes, the static cache (for the graph structure) isn't good 
implementation, that is one reason why I said the HNSW branch is still on 
pretty early stage... 

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. Mostly it just looks like a vector-valued 
> field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many 
> nocommits, basic design is not really set, there is no Query implementation 
> and no integration iwth IndexSearcher, but it does work by some measure using 
> a standalone test class. I've tested with uniform random vectors and on my 
> laptop 

[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-06 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031702#comment-17031702
 ] 

Tomoko Uchida commented on LUCENE-9201:
---

[~erickerickson] I added sub-tasks equivalent to the ant targets.
 - -check-broken-links (this internally calls 
{{dev-tools/scripts/checkJavadocLinks.py}})
 - -check-missing-javadocs (this internally calls 
{{dev-tools/scripts/checkJavaDocs.py}} )

And I opened a PR :)

[https://github.com/apache/lucene-solr/pull/1242]

I think this is almost equivalent to Ant's "documentation-lint", with some 
notes below. [~erickerickson] [~dweiss] Could you review it?

*Note:*

For now, Python linters - {{checkBrokenLinks}}, {{checkMissingJavadocsClass}} 
and {{checkMissingJavadocsMethod}} - will fail because the Gradle-generated 
Javadocs seems to be slightly different to Ant-generated ones.
 * Javadoc directory structure: "ant documentation" generates 
"analyzers-common" docs dir for "analysis/common" module, but "gradlew javadoc" 
generates "analysis/common" for the same module. I think we can adjust the 
structure, but where is the suitable place to do so?
 * Package summary: "ant documentation" uses "package.html" as package summary 
description, but "gradlew javadoc" ignores "package.html" (so some packages 
lacks summary description in "package-summary.html" when building javadocs by 
Gradle). We might be able to make Gradle Javadoc task to properly handle 
"package.html" files with some options. Or, should we replace all 
"package.html" with "package-info.java" at this time?

After Gradle generated Javadoc is fixed, we can return to here and complete 
this sub-task.

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9077) Gradle build

2020-02-06 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-9077:
--
Attachment: LUCENE-9077-javadoc-locale-en-US.patch

> Gradle build
> 
>
> Key: LUCENE-9077
> URL: https://issues.apache.org/jira/browse/LUCENE-9077
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-9077-javadoc-locale-en-US.patch
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This task focuses on providing gradle-based build equivalent for Lucene and 
> Solr (on master branch). See notes below on why this respin is needed.
> The code lives on *gradle-master* branch. It is kept with sync with *master*. 
> Try running the following to see an overview of helper guides concerning 
> typical workflow, testing and ant-migration helpers:
> gradlew :help
> A list of items that needs to be added or requires work. If you'd like to 
> work on any of these, please add your name to the list. Once you have a 
> patch/ pull request let me (dweiss) know - I'll try to coordinate the merges.
>  * (/) Apply forbiddenAPIs
>  * (/) Generate hardware-aware gradle defaults for parallelism (count of 
> workers and test JVMs).
>  * (/) Fail the build if --tests filter is applied and no tests execute 
> during the entire build (this allows for an empty set of filtered tests at 
> single project level).
>  * (/) Port other settings and randomizations from common-build.xml
>  * (/) Configure security policy/ sandboxing for tests.
>  * (/) test's console output on -Ptests.verbose=true
>  * (/) add a :helpDeps explanation to how the dependency system works 
> (palantir plugin, lockfile) and how to retrieve structured information about 
> current dependencies of a given module (in a tree-like output).
>  * (/) jar checksums, jar checksum computation and validation. This should be 
> done without intermediate folders (directly on dependency sets).
>  * (/) verify min. JVM version and exact gradle version on build startup to 
> minimize odd build side-effects
>  * (/) Repro-line for failed tests/ runs.
>  * (/) add a top-level README note about building with gradle (and the 
> required JVM).
>  * (/) add an equivalent of 'validate-source-patterns' 
> (check-source-patterns.groovy) to precommit.
>  * (/) add an equivalent of 'rat-sources' to precommit.
>  * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) 
> to precommit.
>  * (/) javadoc compilation
> Hard-to-implement stuff already investigated:
>  * (/) (done)  -*Printing console output of failed tests.* There doesn't seem 
> to be any way to do this in a reasonably efficient way. There are onOutput 
> listeners but they're slow to operate and solr tests emit *tons* of output so 
> it's an overkill.-
>  * (!) (LUCENE-9120) *Tests working with security-debug logs or other 
> JVM-early log output*. Gradle's test runner works by redirecting Java's 
> stdout/ syserr so this just won't work. Perhaps we can spin the ant-based 
> test runner for such corner-cases.
> Of lesser importance:
>  * Add an equivalent of 'documentation-lint" to precommit.
>  * (/) Do not require files to be committed before running precommit. (staged 
> files are fine).
>  * (/) add rendering of javadocs (gradlew javadoc)
>  * Attach javadocs to maven publications.
>  * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid 
> it'll be difficult to run it sensibly because gradle doesn't offer cwd 
> separation for the forked test runners.
>  * if you diff solr packaged distribution against ant-created distribution 
> there are minor differences in library versions and some JARs are excluded/ 
> moved around. I didn't try to force these as everything seems to work (tests, 
> etc.) – perhaps these differences should  be fixed in the ant build instead.
>  * [EOE] identify and port various "regenerate" tasks from ant builds 
> (javacc, precompiled automata, etc.)
>  * Fill in POM details in gradle/defaults-maven.gradle so that they reflect 
> the previous content better (dependencies aside).
>  * Add any IDE integration layers that should be added (I use IntelliJ and it 
> imports the project out of the box, without the need for any special tuning).
>  * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; 
> currently XSLT...)
>  * I didn't bother adding Solr dist/test-framework to packaging (who'd use it 
> from a binary distribution? 
>  * There is some python execution in check-broken-links and 
> check-missing-javadocs, not sure if it's been ported
>  * Nightly-smoke also have some python execution, not sure of the status.
>  * Precommit doesn't catch unused imports
>  
> *{color:#ff}Note:{color}* this builds on the work 

[jira] [Commented] (LUCENE-9077) Gradle build

2020-02-06 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031732#comment-17031732
 ] 

Tomoko Uchida commented on LUCENE-9077:
---

I found a JDK Javadoc tool related issue which was fixed on ant build on 
https://issues.apache.org/jira/browse/LUCENE-8738?focusedCommentId=16822659=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16822659.
 I attached the same workaround patch [^LUCENE-9077-javadoc-locale-en-US.patch] 
for graldle build. Will commit it soon.

 

> Gradle build
> 
>
> Key: LUCENE-9077
> URL: https://issues.apache.org/jira/browse/LUCENE-9077
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-9077-javadoc-locale-en-US.patch
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This task focuses on providing gradle-based build equivalent for Lucene and 
> Solr (on master branch). See notes below on why this respin is needed.
> The code lives on *gradle-master* branch. It is kept with sync with *master*. 
> Try running the following to see an overview of helper guides concerning 
> typical workflow, testing and ant-migration helpers:
> gradlew :help
> A list of items that needs to be added or requires work. If you'd like to 
> work on any of these, please add your name to the list. Once you have a 
> patch/ pull request let me (dweiss) know - I'll try to coordinate the merges.
>  * (/) Apply forbiddenAPIs
>  * (/) Generate hardware-aware gradle defaults for parallelism (count of 
> workers and test JVMs).
>  * (/) Fail the build if --tests filter is applied and no tests execute 
> during the entire build (this allows for an empty set of filtered tests at 
> single project level).
>  * (/) Port other settings and randomizations from common-build.xml
>  * (/) Configure security policy/ sandboxing for tests.
>  * (/) test's console output on -Ptests.verbose=true
>  * (/) add a :helpDeps explanation to how the dependency system works 
> (palantir plugin, lockfile) and how to retrieve structured information about 
> current dependencies of a given module (in a tree-like output).
>  * (/) jar checksums, jar checksum computation and validation. This should be 
> done without intermediate folders (directly on dependency sets).
>  * (/) verify min. JVM version and exact gradle version on build startup to 
> minimize odd build side-effects
>  * (/) Repro-line for failed tests/ runs.
>  * (/) add a top-level README note about building with gradle (and the 
> required JVM).
>  * (/) add an equivalent of 'validate-source-patterns' 
> (check-source-patterns.groovy) to precommit.
>  * (/) add an equivalent of 'rat-sources' to precommit.
>  * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) 
> to precommit.
>  * (/) javadoc compilation
> Hard-to-implement stuff already investigated:
>  * (/) (done)  -*Printing console output of failed tests.* There doesn't seem 
> to be any way to do this in a reasonably efficient way. There are onOutput 
> listeners but they're slow to operate and solr tests emit *tons* of output so 
> it's an overkill.-
>  * (!) (LUCENE-9120) *Tests working with security-debug logs or other 
> JVM-early log output*. Gradle's test runner works by redirecting Java's 
> stdout/ syserr so this just won't work. Perhaps we can spin the ant-based 
> test runner for such corner-cases.
> Of lesser importance:
>  * Add an equivalent of 'documentation-lint" to precommit.
>  * (/) Do not require files to be committed before running precommit. (staged 
> files are fine).
>  * (/) add rendering of javadocs (gradlew javadoc)
>  * Attach javadocs to maven publications.
>  * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid 
> it'll be difficult to run it sensibly because gradle doesn't offer cwd 
> separation for the forked test runners.
>  * if you diff solr packaged distribution against ant-created distribution 
> there are minor differences in library versions and some JARs are excluded/ 
> moved around. I didn't try to force these as everything seems to work (tests, 
> etc.) – perhaps these differences should  be fixed in the ant build instead.
>  * [EOE] identify and port various "regenerate" tasks from ant builds 
> (javacc, precompiled automata, etc.)
>  * Fill in POM details in gradle/defaults-maven.gradle so that they reflect 
> the previous content better (dependencies aside).
>  * Add any IDE integration layers that should be added (I use IntelliJ and it 
> imports the project out of the box, without the need for any special tuning).
>  * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; 
> currently XSLT...)
>  * I didn't bother adding Solr dist/test-framework to 

[jira] [Commented] (LUCENE-9077) Gradle build

2020-02-02 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028347#comment-17028347
 ] 

Tomoko Uchida commented on LUCENE-9077:
---

Hi [~dweiss],
{quote}Add an equivalent of 'documentation-lint" to precommit.
{quote}
what's the current status of that?

I just started to try to port the "documentation-lint" task on my local branch 
: 
[https://github.com/mocobeta/lucene-solr-mirror/commit/7adc390183b10ea1b64fded000a87900853cf912]
 I'm not sure if it is still significant task to Gradle build. Would it be any 
help here?

> Gradle build
> 
>
> Key: LUCENE-9077
> URL: https://issues.apache.org/jira/browse/LUCENE-9077
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This task focuses on providing gradle-based build equivalent for Lucene and 
> Solr (on master branch). See notes below on why this respin is needed.
> The code lives on *gradle-master* branch. It is kept with sync with *master*. 
> Try running the following to see an overview of helper guides concerning 
> typical workflow, testing and ant-migration helpers:
> gradlew :help
> A list of items that needs to be added or requires work. If you'd like to 
> work on any of these, please add your name to the list. Once you have a 
> patch/ pull request let me (dweiss) know - I'll try to coordinate the merges.
>  * (/) Apply forbiddenAPIs
>  * (/) Generate hardware-aware gradle defaults for parallelism (count of 
> workers and test JVMs).
>  * (/) Fail the build if --tests filter is applied and no tests execute 
> during the entire build (this allows for an empty set of filtered tests at 
> single project level).
>  * (/) Port other settings and randomizations from common-build.xml
>  * (/) Configure security policy/ sandboxing for tests.
>  * (/) test's console output on -Ptests.verbose=true
>  * (/) add a :helpDeps explanation to how the dependency system works 
> (palantir plugin, lockfile) and how to retrieve structured information about 
> current dependencies of a given module (in a tree-like output).
>  * (/) jar checksums, jar checksum computation and validation. This should be 
> done without intermediate folders (directly on dependency sets).
>  * (/) verify min. JVM version and exact gradle version on build startup to 
> minimize odd build side-effects
>  * (/) Repro-line for failed tests/ runs.
>  * (/) add a top-level README note about building with gradle (and the 
> required JVM).
>  * (/) add an equivalent of 'validate-source-patterns' 
> (check-source-patterns.groovy) to precommit.
>  * (/) add an equivalent of 'rat-sources' to precommit.
>  * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) 
> to precommit.
> * (/) javadoc compilation
> Hard-to-implement stuff already investigated:
>  * (/) (done)  -*Printing console output of failed tests.* There doesn't seem 
> to be any way to do this in a reasonably efficient way. There are onOutput 
> listeners but they're slow to operate and solr tests emit *tons* of output so 
> it's an overkill.-
>  * (!) (LUCENE-9120) *Tests working with security-debug logs or other 
> JVM-early log output*. Gradle's test runner works by redirecting Java's 
> stdout/ syserr so this just won't work. Perhaps we can spin the ant-based 
> test runner for such corner-cases.
> Of lesser importance:
>  * Add an equivalent of 'documentation-lint" to precommit.
>  * (/) Do not require files to be committed before running precommit. (staged 
> files are fine).
>  * (/) add rendering of javadocs (gradlew javadoc)
>  * Attach javadocs to maven publications.
>  * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid 
> it'll be difficult to run it sensibly because gradle doesn't offer cwd 
> separation for the forked test runners.
>  * if you diff solr packaged distribution against ant-created distribution 
> there are minor differences in library versions and some JARs are excluded/ 
> moved around. I didn't try to force these as everything seems to work (tests, 
> etc.) – perhaps these differences should  be fixed in the ant build instead.
>  * [EOE] identify and port various "regenerate" tasks from ant builds 
> (javacc, precompiled automata, etc.)
>  * Fill in POM details in gradle/defaults-maven.gradle so that they reflect 
> the previous content better (dependencies aside).
>  * Add any IDE integration layers that should be added (I use IntelliJ and it 
> imports the project out of the box, without the need for any special tuning).
>  * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; 
> currently XSLT...)
>  * I didn't bother adding Solr dist/test-framework to packaging (who'd use it 
> from a binary 

[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-07 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032601#comment-17032601
 ] 

Tomoko Uchida commented on LUCENE-9201:
---

Thank you [~rcmuir] for your work and comments.

I updated the PR (refactored the gradle tasks and ported ant build details, as 
much as I can). I hope it is a good starting point, if not perfect. There are 
still not ported the ant scripts' hacks, especially around "ecj-macro" stuff, 
that I cannot figure out how to copy to gradle.

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2020-02-04 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029841#comment-17029841
 ] 

Tomoko Uchida commented on LUCENE-9004:
---

{quote}
Unfortunately, I couldn't obtain the corresponding results of HNSW due to the 
out of memory error in my PC.
{quote} 

Current HNSW implementation requires 4GB heap for 1M dataset / 200 dimension 
vectors (we need to reduce the memory consumption). The default heap size that 
is given to Java processes depends on platforms, but for most commodity PCs it 
wouldn't be so large so you will see OOM if you are not set the -Xmx JVM arg.




> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. Mostly it just looks like a vector-valued 
> field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many 
> nocommits, basic design is not really set, there is no Query implementation 
> and no integration iwth IndexSearcher, but it does work by some measure using 
> a standalone test class. I've tested with uniform random vectors and on my 
> laptop indexed 

[jira] [Created] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-02 Thread Tomoko Uchida (Jira)
Tomoko Uchida created LUCENE-9201:
-

 Summary: Port documentation-lint task to Gradle build
 Key: LUCENE-9201
 URL: https://issues.apache.org/jira/browse/LUCENE-9201
 Project: Lucene - Core
  Issue Type: Sub-task
Affects Versions: master (9.0)
Reporter: Tomoko Uchida
Assignee: Tomoko Uchida


Ant build's "documentation-lint" target consists of those two sub targets.

- "-ecj-javadoc-lint" (Javadoc linting by ECJ)
- "-documentation-lint"(Missing javadocs / broken links check by python scripts)





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-02 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-9201:
--
Description: 
Ant build's "documentation-lint" target consists of those two sub targets.
 * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
 * "-documentation-lint"(Missing javadocs / broken links check by python 
scripts)

  was:
Ant build's "documentation-lint" target consists of those two sub targets.

- "-ecj-javadoc-lint" (Javadoc linting by ECJ)
- "-documentation-lint"(Missing javadocs / broken links check by python scripts)




> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9077) Gradle build

2020-02-02 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028665#comment-17028665
 ] 

Tomoko Uchida commented on LUCENE-9077:
---

[~erickerickson] thanks for your comment. I opened a sub-task: LUCENE-9201

 

> Gradle build
> 
>
> Key: LUCENE-9077
> URL: https://issues.apache.org/jira/browse/LUCENE-9077
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This task focuses on providing gradle-based build equivalent for Lucene and 
> Solr (on master branch). See notes below on why this respin is needed.
> The code lives on *gradle-master* branch. It is kept with sync with *master*. 
> Try running the following to see an overview of helper guides concerning 
> typical workflow, testing and ant-migration helpers:
> gradlew :help
> A list of items that needs to be added or requires work. If you'd like to 
> work on any of these, please add your name to the list. Once you have a 
> patch/ pull request let me (dweiss) know - I'll try to coordinate the merges.
>  * (/) Apply forbiddenAPIs
>  * (/) Generate hardware-aware gradle defaults for parallelism (count of 
> workers and test JVMs).
>  * (/) Fail the build if --tests filter is applied and no tests execute 
> during the entire build (this allows for an empty set of filtered tests at 
> single project level).
>  * (/) Port other settings and randomizations from common-build.xml
>  * (/) Configure security policy/ sandboxing for tests.
>  * (/) test's console output on -Ptests.verbose=true
>  * (/) add a :helpDeps explanation to how the dependency system works 
> (palantir plugin, lockfile) and how to retrieve structured information about 
> current dependencies of a given module (in a tree-like output).
>  * (/) jar checksums, jar checksum computation and validation. This should be 
> done without intermediate folders (directly on dependency sets).
>  * (/) verify min. JVM version and exact gradle version on build startup to 
> minimize odd build side-effects
>  * (/) Repro-line for failed tests/ runs.
>  * (/) add a top-level README note about building with gradle (and the 
> required JVM).
>  * (/) add an equivalent of 'validate-source-patterns' 
> (check-source-patterns.groovy) to precommit.
>  * (/) add an equivalent of 'rat-sources' to precommit.
>  * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) 
> to precommit.
> * (/) javadoc compilation
> Hard-to-implement stuff already investigated:
>  * (/) (done)  -*Printing console output of failed tests.* There doesn't seem 
> to be any way to do this in a reasonably efficient way. There are onOutput 
> listeners but they're slow to operate and solr tests emit *tons* of output so 
> it's an overkill.-
>  * (!) (LUCENE-9120) *Tests working with security-debug logs or other 
> JVM-early log output*. Gradle's test runner works by redirecting Java's 
> stdout/ syserr so this just won't work. Perhaps we can spin the ant-based 
> test runner for such corner-cases.
> Of lesser importance:
>  * Add an equivalent of 'documentation-lint" to precommit.
>  * (/) Do not require files to be committed before running precommit. (staged 
> files are fine).
>  * (/) add rendering of javadocs (gradlew javadoc)
>  * Attach javadocs to maven publications.
>  * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid 
> it'll be difficult to run it sensibly because gradle doesn't offer cwd 
> separation for the forked test runners.
>  * if you diff solr packaged distribution against ant-created distribution 
> there are minor differences in library versions and some JARs are excluded/ 
> moved around. I didn't try to force these as everything seems to work (tests, 
> etc.) – perhaps these differences should  be fixed in the ant build instead.
>  * [EOE] identify and port various "regenerate" tasks from ant builds 
> (javacc, precompiled automata, etc.)
>  * Fill in POM details in gradle/defaults-maven.gradle so that they reflect 
> the previous content better (dependencies aside).
>  * Add any IDE integration layers that should be added (I use IntelliJ and it 
> imports the project out of the box, without the need for any special tuning).
>  * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; 
> currently XSLT...)
>  * I didn't bother adding Solr dist/test-framework to packaging (who'd use it 
> from a binary distribution? 
>  
> *{color:#ff}Note:{color}* this builds on the work done by Mark Miller and 
> Cao Mạnh Đạt but also applies lessons learned from those two efforts:
>  * *Do not try to do too many things at once*. If we deviate too far from 
> master, the branch will be hard to merge.
>  * *Do everything 

[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-02 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028662#comment-17028662
 ] 

Tomoko Uchida commented on LUCENE-9201:
---

This is WIP branch:
 
[https://github.com/mocobeta/lucene-solr-mirror/commit/7adc390183b10ea1b64fded000a87900853cf912]

To get started I'm trying to port the ECJ task. [The 
compiler|https://help.eclipse.org/2019-03/index.jsp?topic=%2Forg.eclipse.jdt.doc.user%2Ftasks%2Ftask-using_batch_compiler.htm]
 seems to work by retrieving dependencies from each sub project 
"configurations", but WARNINGS are suppressed (Ant build outputs a lot 
"ecj-lint" warnings). I twisted Gradle logger level for the STDOUT, that didn't 
work for me.

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-31 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-9123:
--
Fix Version/s: 8.5
   master (9.0)

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Kazuaki Hiraga
>Assignee: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: LUCENE-9123.patch, LUCENE-9123_8x.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-31 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved LUCENE-9123.
---
Resolution: Fixed

Merged the patches into the master (with a MIGRATE notice) and branch_8x.
Thanks [~h.kazuaki]!

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Kazuaki Hiraga
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_8x.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-31 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-9123:
--
Affects Version/s: (was: 8.4)

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Kazuaki Hiraga
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_8x.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-20 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041159#comment-17041159
 ] 

Tomoko Uchida commented on LUCENE-9201:
---

Hi,

there remain two documentation-lint tasks to be ported. I opened for a pull 
request for the easy one - "check missing javadocs" task that can be defined on 
each sub project.
 [https://github.com/apache/lucene-solr/pull/1267]

This is functionally same to my previous pull request, but I rewrote it in a 
bit more declarative manner. It depends on gradle default Javadoc task for now, 
I think the basic logic can be applied when we switch to our custom javadoc 
task. Could you review it or give some comments for this?

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, 
> javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-20 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041449#comment-17041449
 ] 

Tomoko Uchida commented on LUCENE-9201:
---

Hi Dawid, thanks for your comments! I think I get the points, I will update the 
merge request and notify you.

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, 
> javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9155) Port Kuromoji dictionary compilation (regenerate)

2020-02-20 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041549#comment-17041549
 ] 

Tomoko Uchida commented on LUCENE-9155:
---

Actually 'naist' here is not a dictionary format/type but it's all about data 
(word entries and the language model). So technically it should be able to be 
built by the same logic for the 'ipadic' (default).

I agree with that current ant script for switching dictionary sources isn't 
great. Also the 'naist' dictionary is no longer largely used for practical 
purpose as far as I know, though it has historical significance in academic 
area. I think we can skip 'naist' dictionary itself for now, and focus on 
providing better way to use/build alternative dictionary (along with 
LUCENE-8816, I would like to restart it when we complete the migration to 
gradle build).

> Port Kuromoji dictionary compilation (regenerate)
> -
>
> Key: LUCENE-9155
> URL: https://issues.apache.org/jira/browse/LUCENE-9155
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Attachments: kuromoji.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9155) Port Kuromoji dictionary compilation (regenerate)

2020-02-20 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041571#comment-17041571
 ] 

Tomoko Uchida commented on LUCENE-9155:
---

If we drop/skip the build for the 'naist' dictionary for the next major 
release, I think we have to mention about it in Changes though I am not sure 
there is someone who cares about it...

> Port Kuromoji dictionary compilation (regenerate)
> -
>
> Key: LUCENE-9155
> URL: https://issues.apache.org/jira/browse/LUCENE-9155
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Attachments: kuromoji.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-11 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034462#comment-17034462
 ] 

Tomoko Uchida edited comment on LUCENE-9201 at 2/11/20 2:00 PM:


Just from curiosity, I roughly assessed the penalty of "copying all javadocs" 
on my PC (Core i7-8700 cpu, fedora).

// when the javadocs is already collected into one directory, and just run the 
python linter
{code:java}
 $ ./gradlew checkBrokenLinks
 BUILD FAILED in 29s
{code}
// first collect javadocs into a directory on HDD (7200rpm), and run the python 
linter
{code:java}
 $ ./gradlew checkBrokenLinks
 BUILD FAILED in 37s
{code}
When I did same thing on NVMe SSD, there was almost no penalty, just in case.

(note: "BUILD FAILED" is an expected result for now.)


was (Author: tomoko uchida):
Just from curiosity, I roughly assessed the penalty of "copying all javadocs" 
on my PC (Core i7-8700 cpu, fedora).

// when the javadocs is already collected into one directory, and just run the 
python linter
{code:java}
 $ ./gradlew checkBrokenLinks
 BUILD FAILED in 29s
{code}
// first collect javadocs into a directory on HDD (7200rpm), and run the python 
linter
{code:java}
 $ ./gradlew checkBrokenLinks
 BUILD FAILED in 37s
{code}
When I did same thing on NVMe SSD, there was almost no penalty, just in case.

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-11 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034462#comment-17034462
 ] 

Tomoko Uchida commented on LUCENE-9201:
---

Just from curiosity, I roughly assessed the penalty of "copying all javadocs" 
on my PC (Core i7-8700 cpu, fedora).

// when the javadocs is already collected into one directory, and just run the 
python linter
{code:java}
 $ ./gradlew checkBrokenLinks
 BUILD FAILED in 29s
{code}
// first collect javadocs into a directory on HDD (7200rpm), and run the python 
linter
{code:java}
 $ ./gradlew checkBrokenLinks
 BUILD FAILED in 37s
{code}
When I did same thing on NVMe SSD, there was almost no penalty, just in case.

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-11 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034375#comment-17034375
 ] 

Tomoko Uchida commented on LUCENE-9201:
---

{quote}It is my personal preference to have a project-scope granularity. This 
way you can run project-scoped task (like gradlew -p lucene/core javadoc). My 
personal take on assembling "distributions" is to have a separate project that 
just takes what it needs from other projects and puts it together (with any 
tweaks required). This makes it easier to reason about how a distribution is 
assembled and from where, while each project just takes care of itself.
{quote}
I get it, have no objection. Let's keep the javadocs output as is and gather 
them when it's needed. We might want an independent task to gather all 
javadocs, though I did it in the "checking broken links" task for the present.
{quote}Let me look at the patch again later today (digging myself out of the 
vacation hole).
{quote}
Thank you, I updated the PR according to comments from Uwe Schindler. Also I 
added "// FIXME" comments to some code that don't work well or need to be fixed.

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-17 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038257#comment-17038257
 ] 

Tomoko Uchida commented on LUCENE-9201:
---

Thanks, I got the current status. I will try to create a patch for 
[LUCENE-9219].

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9201-ecj.patch, javadocGRADLE.png, 
> javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-16 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037788#comment-17037788
 ] 

Tomoko Uchida commented on LUCENE-9201:
---

[~dweiss] Thank you for your guidance and detailed comments in the patch!

I applied this [^LUCENE-9201-ecj.patch] to my branch and ran the {{ecjLint}} 
task, then found the task seemed to finish in one second without executing the 
ECJ's Main class. (I intentionally added an unused import to o.a.l.a.Analyzer 
but that wasn't detected.)
{code:java}
$ ./gradlew :lucene:core:ecjLint
BUILD SUCCESSFUL in 1s
4 actionable tasks: 4 up-to-date
{code}
I have not found any problems in the task definition (and not yet studied 
deeply how it works), but am I missing something?

With my previous patch [https://github.com/apache/lucene-solr/pull/1242/files], 
for example, the {{:lucene:core:ecjLint}} takes about five seconds and 
(correctly) fails with this failure message if there are unused imports.
{code:java}
$ ./gradlew :lucene:core:ecjLint
> Task :lucene:core:ecjLint
--
1. ERROR in 
/mnt/hdd/repo/lucene-solr-mirror/lucene/core/src/java/org/apache/lucene/analysis/Analyzer.java
 (at line 27)
import java.util.ArrayList;
   ^^^
The import java.util.ArrayList is never used
--
1 problem (1 error)

> Task :lucene:core:ecjLint FAILED
FAILURE: Build failed with an exception.

* Where:
Script 
'/mnt/hdd/repo/lucene-solr-mirror/gradle/validation/documentation-lint.gradle' 
line: 130
* What went wrong:
Execution failed for task ':lucene:core:ecjLint'.
> Process 'command '/usr/local/java/adoptopenjdk/jdk-11.0.3+7/bin/java'' 
> finished with non-zero exit value 255
* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug 
option to get more log output. Run with --scan to get full insights.
* Get more help at https://help.gradle.org

BUILD FAILED in 4s
{code}
I will look a bit more closely at why the ecjLint task doesn't work for me...

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9201-ecj.patch, javadocGRADLE.png, 
> javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-23 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042900#comment-17042900
 ] 

Tomoko Uchida commented on LUCENE-9201:
---

[~dweiss] I did some tests with the patch [^LUCENE-9201.patch] and noticed this 
seems to cause an error with :solr:server project, where the {{java-libarry}} 
plugin is applied but there is actually no Java source - hence no Javadocs 
folder.
{code:java}
lucene-solr $ ./gradlew :solr:server:checkMissingDocs
> Task :solr:server:checkMissingDocsDefault FAILED

FAILURE: Build failed with an exception.

* Where:
Script '/mnt/hdd/repo/lucene-solr/gradle/validation/missing-docs-check.gradle' 
line: 105

* What went wrong:
Execution failed for task ':solr:server:checkMissingDocsDefault'.
> Javadoc verification failed:
  Traceback (most recent call last):
File "/mnt/hdd/repo/lucene-solr/dev-tools/scripts/checkJavaDocs.py", line 
388, in 
  if checkPackageSummaries(sys.argv[1], level):
File "/mnt/hdd/repo/lucene-solr/dev-tools/scripts/checkJavaDocs.py", line 
351, in checkPackageSummaries
  checkClassSummaries(root)
File "/mnt/hdd/repo/lucene-solr/dev-tools/scripts/checkJavaDocs.py", line 
154, in checkClassSummaries
  f = open(fullPath, encoding='UTF-8')
  FileNotFoundError: [Errno 2] No such file or directory: 
'/mnt/hdd/repo/lucene-solr/solr/server/build/docs/javadoc'
{code}
How can we properly exclude such irregular projects? This workaround works for 
me, does this make sence...?
{code:java}
   @TaskAction
   def lint() {
-dirs.each { dir ->
+//dirs.each { dir ->
+dirs.findAll { project.file(it).exists() }.each { dir ->
   project.logger.info("Checking for missing docs... (dir=${dir}, 
level=${level})")
   checkMissingJavadocs(dir, level)
 }
{code}

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, 
> LUCENE-9201.patch, javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-23 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042900#comment-17042900
 ] 

Tomoko Uchida edited comment on LUCENE-9201 at 2/23/20 11:11 AM:
-

[~dweiss] I did some tests with the patch [^LUCENE-9201.patch] and noticed this 
seems to cause an error with :solr:server project, where the {{java-libarry}} 
plugin is applied but there is actually no Java source - hence no Javadocs 
folder.
{code:java}
lucene-solr $ ./gradlew :solr:server:checkMissingDocs
> Task :solr:server:checkMissingDocsDefault FAILED

FAILURE: Build failed with an exception.

* Where:
Script '/mnt/hdd/repo/lucene-solr/gradle/validation/missing-docs-check.gradle' 
line: 105

* What went wrong:
Execution failed for task ':solr:server:checkMissingDocsDefault'.
> Javadoc verification failed:
  Traceback (most recent call last):
File "/mnt/hdd/repo/lucene-solr/dev-tools/scripts/checkJavaDocs.py", line 
388, in 
  if checkPackageSummaries(sys.argv[1], level):
File "/mnt/hdd/repo/lucene-solr/dev-tools/scripts/checkJavaDocs.py", line 
351, in checkPackageSummaries
  checkClassSummaries(root)
File "/mnt/hdd/repo/lucene-solr/dev-tools/scripts/checkJavaDocs.py", line 
154, in checkClassSummaries
  f = open(fullPath, encoding='UTF-8')
  FileNotFoundError: [Errno 2] No such file or directory: 
'/mnt/hdd/repo/lucene-solr/solr/server/build/docs/javadoc'
{code}
How can we properly exclude such irregular projects? This workaround works for 
me, does this make sence...?
{code:java}
   @TaskAction
   def lint() {
-dirs.each { dir ->
+//dirs.each { dir ->
+dirs.findAll { it.exists() }.each { dir ->
   project.logger.info("Checking for missing docs... (dir=${dir}, 
level=${level})")
   checkMissingJavadocs(dir, level)
 }
{code}


was (Author: tomoko uchida):
[~dweiss] I did some tests with the patch [^LUCENE-9201.patch] and noticed this 
seems to cause an error with :solr:server project, where the {{java-libarry}} 
plugin is applied but there is actually no Java source - hence no Javadocs 
folder.
{code:java}
lucene-solr $ ./gradlew :solr:server:checkMissingDocs
> Task :solr:server:checkMissingDocsDefault FAILED

FAILURE: Build failed with an exception.

* Where:
Script '/mnt/hdd/repo/lucene-solr/gradle/validation/missing-docs-check.gradle' 
line: 105

* What went wrong:
Execution failed for task ':solr:server:checkMissingDocsDefault'.
> Javadoc verification failed:
  Traceback (most recent call last):
File "/mnt/hdd/repo/lucene-solr/dev-tools/scripts/checkJavaDocs.py", line 
388, in 
  if checkPackageSummaries(sys.argv[1], level):
File "/mnt/hdd/repo/lucene-solr/dev-tools/scripts/checkJavaDocs.py", line 
351, in checkPackageSummaries
  checkClassSummaries(root)
File "/mnt/hdd/repo/lucene-solr/dev-tools/scripts/checkJavaDocs.py", line 
154, in checkClassSummaries
  f = open(fullPath, encoding='UTF-8')
  FileNotFoundError: [Errno 2] No such file or directory: 
'/mnt/hdd/repo/lucene-solr/solr/server/build/docs/javadoc'
{code}
How can we properly exclude such irregular projects? This workaround works for 
me, does this make sence...?
{code:java}
   @TaskAction
   def lint() {
-dirs.each { dir ->
+//dirs.each { dir ->
+dirs.findAll { project.file(it).exists() }.each { dir ->
   project.logger.info("Checking for missing docs... (dir=${dir}, 
level=${level})")
   checkMissingJavadocs(dir, level)
 }
{code}

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, 
> LUCENE-9201.patch, javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9242) Gradle Javadoc task does not include package summaries

2020-02-23 Thread Tomoko Uchida (Jira)
Tomoko Uchida created LUCENE-9242:
-

 Summary: Gradle Javadoc task does not include package summaries
 Key: LUCENE-9242
 URL: https://issues.apache.org/jira/browse/LUCENE-9242
 Project: Lucene - Core
  Issue Type: Sub-task
  Components: general/javadocs
Affects Versions: master (9.0)
Reporter: Tomoko Uchida


"javadoc" task for the Gradle build does not correctly output package 
summaries, since it ignores "package.html" file in the source tree (so the 
Python linter {{checkJavaDocs.py}} detects that and fails for now.)

See for more details: LUCENE-9201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-23 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043140#comment-17043140
 ] 

Tomoko Uchida commented on LUCENE-9201:
---

I just merged [^LUCENE-9201-missing-docs.patch] into the master. It would be a 
bit hard to maintain the python script (it relies on Javadoc HTML details) ... 
but the linter will do its work, until it is replaced with LUCENE-9215.
 Also, I opened LUCENE-9242 to fix Javadocs for gradle build - should we 
directly call the javadoc tool, instead of using Gradle's default one?

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, 
> LUCENE-9201-missing-docs.patch, LUCENE-9201.patch, javadocGRADLE.png, 
> javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2020-02-14 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037404#comment-17037404
 ] 

Tomoko Uchida commented on LUCENE-9004:
---

{quote}It can readily be shown that HNSW performs much better in query time. 
But I was surprised that top 1 in-set recall percent of HNSW is so low. It 
shouldn't be a problem of algorithm itself, but more likely a problem of 
implementation or test code. I will check it this weekend.
{quote}
Thanks [~irvingzhang] for measuring. I noticed I might have made a very basic 
mistake when comparing neighborhood nodes, maybe some inequality signs should 
be flipped :/
 I will do recall performance tests with GloVE and fix the bugs.

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. Mostly it just looks like a vector-valued 
> field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many 
> nocommits, basic design is not really set, there is no Query implementation 
> and no integration iwth IndexSearcher, but it does work by some measure using 
> a 

[jira] [Comment Edited] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-18 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039137#comment-17039137
 ] 

Tomoko Uchida edited comment on LUCENE-9201 at 2/18/20 2:59 PM:


Yes, the patch works for me too, with this modification. It's not explicitly 
documented, but {{sourceSet.getTaskName("ecjLint", null)}} generates 
"ecjLintMain" for main sourceSet and "ecjLintTest" for test sourceSet (on 
Gradle version 6.0).
{code:java}
 //tasks.create("${sourceSet.name}EcjLint", JavaExec, {
 tasks.create(sourceSet.getTaskName("ecjLint", null), JavaExec, {
{code}
 

Can I ask a few more questions before creating a patch?

1. Even when I commented out those lines in gradle.build, 
":solr:solr-ref-guide:ecjLint" finished without doing nothing (linting was 
safely skipped by the changes in solr/solr-ref-guide/build.gradle, if my 
understanding is correct). Is this configuration still needed?
{code:java}
// This excludes solr-ref-guide from the check (excludes are not taken into 
account
// and linting of the ant-based task fails.
configure(project(":solr:solr-ref-guide")) {
  afterEvaluate {
project.tasks.findByPath("mainEcjLint").enabled = false
  }
}
{code}
2. Currently all other check tasks are grouped in "Verification", so would it 
be better to change the task group name "validation" to "Verification"?
{code:java}
$ ./gradles tasks
...
Validation tasks

ecjLint - Lint Java sources using ECJ.

Verification tasks
--
check - Runs all checks.
checkUnusedConstraints - Ensures all versions in your versions.props correspond 
to an actual gradle dependency
forbiddenApis - Runs forbidden-apis checks.
owasp - Check project dependencies against OWASP vulnerability database.
rat - Runs Apache Rat checks.
test - Runs the unit tests.
verifyLocks - Verifies that your versions.lock is up to date
{code}


was (Author: tomoko uchida):
Yes, the patch works for me too, with this modification. It's not explicitly 
documented, but {{sourceSet.getTaskName("ecjLint", null)}} generates 
"ecjLintMain" for main sourceSet and "ecjLintTest" for test sourceSet (on 
Gradle version 6.0).
{code:java}
 //tasks.create("${sourceSet.name}EcjLint", JavaExec, {
 tasks.create(sourceSet.getTaskName("ecjLint", null), JavaExec, {
{code}
 

Can I ask a few questions before creating a patch?

1. When I commented out those lines in gradle.build, 
":solr:solr-ref-guide:ecjLint" finished without doing nothing (linting was 
safely skipped by the changes in solr/solr-ref-guide/build.gradle, if my 
understanding is correct). Is this configuration still needed?
{code:java}
// This excludes solr-ref-guide from the check (excludes are not taken into 
account
// and linting of the ant-based task fails.
configure(project(":solr:solr-ref-guide")) {
  afterEvaluate {
project.tasks.findByPath("mainEcjLint").enabled = false
  }
}
{code}
2. Currently all other check tasks are grouped in "Verification", so would it 
be better to change the task group name "validation" to "Verification"?
{code:java}
$ ./gradles tasks
...
Validation tasks

ecjLint - Lint Java sources using ECJ.

Verification tasks
--
check - Runs all checks.
checkUnusedConstraints - Ensures all versions in your versions.props correspond 
to an actual gradle dependency
forbiddenApis - Runs forbidden-apis checks.
owasp - Check project dependencies against OWASP vulnerability database.
rat - Runs Apache Rat checks.
test - Runs the unit tests.
verifyLocks - Verifies that your versions.lock is up to date
{code}

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, 
> javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9219) Port ECJ-based linter to gradle

2020-02-18 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-9219:
--
Attachment: LUCENE-9219.patch

> Port ECJ-based linter to gradle
> ---
>
> Key: LUCENE-9219
> URL: https://issues.apache.org/jira/browse/LUCENE-9219
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Priority: Major
> Attachments: LUCENE-9219.patch, LUCENE-9219.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9219) Port ECJ-based linter to gradle

2020-02-18 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039190#comment-17039190
 ] 

Tomoko Uchida commented on LUCENE-9219:
---

Hi [~dweiss],

I attached [^LUCENE-9219.patch], a modified version of the patch in 
LUCENE-9201. Could you take a look at this?
 (diffs from LUCENE-9201-ecj-2.patch)
 - use sourceSet.getTaskName()
 - task group was changed to "Verification"
 - the artifact version of "org.eclipse.jdt:ecj" was moved to "build.gradle" 
from "ecj-lint.gradle"

Also I excluded the changes for "lucene/common-build.xml", because I thought 
this was added to the patch accidentally. Please let me know if this should be 
included again.
{code:java}
diff --git a/lucene/common-build.xml b/lucene/common-build.xml
index 1e3da88250b..ca5887df550 100644
--- a/lucene/common-build.xml
+++ b/lucene/common-build.xml
@@ -2034,6 +2034,7 @@ 
${ant.project.name}.test.dependencies=${test.classpath.list}
   
 
   
+  
{code}

> Port ECJ-based linter to gradle
> ---
>
> Key: LUCENE-9219
> URL: https://issues.apache.org/jira/browse/LUCENE-9219
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Priority: Major
> Attachments: LUCENE-9219.patch, LUCENE-9219.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9201) Port documentation-lint task to Gradle build

2020-02-18 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039137#comment-17039137
 ] 

Tomoko Uchida commented on LUCENE-9201:
---

Yes, the patch works for me too, with this modification. It's not explicitly 
documented, but {{sourceSet.getTaskName("ecjLint", null)}} generates 
"ecjLintMain" for main sourceSet and "ecjLintTest" for test sourceSet (on 
Gradle version 6.0).
{code:java}
 //tasks.create("${sourceSet.name}EcjLint", JavaExec, {
 tasks.create(sourceSet.getTaskName("ecjLint", null), JavaExec, {
{code}
 

Can I ask a few questions before creating a patch?

1. When I commented out those lines in gradle.build, 
":solr:solr-ref-guide:ecjLint" finished without doing nothing (linting was 
safely skipped by the changes in solr/solr-ref-guide/build.gradle, if my 
understanding is correct). Is this configuration still needed?
{code:java}
// This excludes solr-ref-guide from the check (excludes are not taken into 
account
// and linting of the ant-based task fails.
configure(project(":solr:solr-ref-guide")) {
  afterEvaluate {
project.tasks.findByPath("mainEcjLint").enabled = false
  }
}
{code}
2. Currently all other check tasks are grouped in "Verification", so would it 
be better to change the task group name "validation" to "Verification"?
{code:java}
$ ./gradles tasks
...
Validation tasks

ecjLint - Lint Java sources using ECJ.

Verification tasks
--
check - Runs all checks.
checkUnusedConstraints - Ensures all versions in your versions.props correspond 
to an actual gradle dependency
forbiddenApis - Runs forbidden-apis checks.
owasp - Check project dependencies against OWASP vulnerability database.
rat - Runs Apache Rat checks.
test - Runs the unit tests.
verifyLocks - Verifies that your versions.lock is up to date
{code}

> Port documentation-lint task to Gradle build
> 
>
> Key: LUCENE-9201
> URL: https://issues.apache.org/jira/browse/LUCENE-9201
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: master (9.0)
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9201-ecj-2.patch, LUCENE-9201-ecj.patch, 
> javadocGRADLE.png, javadocHTML4.png, javadocHTML5.png
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Ant build's "documentation-lint" target consists of those two sub targets.
>  * "-ecj-javadoc-lint" (Javadoc linting by ECJ)
>  * "-documentation-lint"(Missing javadocs / broken links check by python 
> scripts)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9219) Port ECJ-based linter to gradle

2020-02-18 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-9219:
--
Attachment: LUCENE-9219.patch

> Port ECJ-based linter to gradle
> ---
>
> Key: LUCENE-9219
> URL: https://issues.apache.org/jira/browse/LUCENE-9219
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Priority: Major
> Attachments: LUCENE-9219.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9219) Port ECJ-based linter to gradle

2020-02-18 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved LUCENE-9219.
---
Fix Version/s: master (9.0)
 Assignee: Tomoko Uchida
   Resolution: Fixed

It was merged.

[~dweiss] Many thanks for your kind help.

> Port ECJ-based linter to gradle
> ---
>
> Key: LUCENE-9219
> URL: https://issues.apache.org/jira/browse/LUCENE-9219
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-9219.patch, LUCENE-9219.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9004) Approximate nearest vector search

2020-01-11 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-9004:
--
Description: 
"Semantic" search based on machine-learned vector "embeddings" representing 
terms, queries and documents is becoming a must-have feature for a modern 
search engine. SOLR-12890 is exploring various approaches to this, including 
providing vector-based scoring functions. This is a spinoff issue from that.

The idea here is to explore approximate nearest-neighbor search. Researchers 
have found an approach based on navigating a graph that partially encodes the 
nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
compared to exact nearest neighbor calculations) at a reasonable cost. This 
issue will explore implementing HNSW (hierarchical navigable small-world) 
graphs for the purpose of approximate nearest vector search (often referred to 
as KNN or k-nearest-neighbor search).

At a high level the way this algorithm works is this. First assume you have a 
graph that has a partial encoding of the nearest neighbor relation, with some 
short and some long-distance links. If this graph is built in the right way 
(has the hierarchical navigable small world property), then you can efficiently 
traverse it to find nearest neighbors (approximately) in log N time where N is 
the number of nodes in the graph. I believe this idea was pioneered in  [1]. 
The great insight in that paper is that if you use the graph search algorithm 
to find the K nearest neighbors of a new document while indexing, and then link 
those neighbors (undirectedly, ie both ways) to the new document, then the 
graph that emerges will have the desired properties.

The implementation I propose for Lucene is as follows. We need two new data 
structures to encode the vectors and the graph. We can encode vectors using a 
light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
dimension and have efficient conversion from bytes to floats). For the graph we 
can use {{SortedNumericDocValues}} where the values we encode are the docids of 
the related documents. Encoding the interdocument relations using docids 
directly will make it relatively fast to traverse the graph since we won't need 
to lookup through an id-field indirection. This choice limits us to building a 
graph-per-segment since it would be impractical to maintain a global graph for 
the whole index in the face of segment merges. However graph-per-segment is a 
very natural at search time - we can traverse each segments' graph 
independently and merge results as we do today for term-based search.

At index time, however, merging graphs is somewhat challenging. While indexing 
we build a graph incrementally, performing searches to construct links among 
neighbors. When merging segments we must construct a new graph containing 
elements of all the merged segments. Ideally we would somehow preserve the work 
done when building the initial graphs, but at least as a start I'd propose we 
construct a new graph from scratch when merging. The process is going to be  
limited, at least initially, to graphs that can fit in RAM since we require 
random access to the entire graph while constructing it: In order to add links 
bidirectionally we must continually update existing documents.

I think we want to express this API to users as a single joint 
{{KnnGraphField}} abstraction that joins together the vectors and the graph as 
a single joint field type. Mostly it just looks like a vector-valued field, but 
has this graph attached to it.

I'll push a branch with my POC and would love to hear comments. It has many 
nocommits, basic design is not really set, there is no Query implementation and 
no integration iwth IndexSearcher, but it does work by some measure using a 
standalone test class. I've tested with uniform random vectors and on my laptop 
indexed 10K documents in around 10 seconds and searched them at 95% recall 
(compared with exact nearest-neighbor baseline) at around 250 QPS. I haven't 
made any attempt to use multithreaded search for this, but it is amenable to 
per-segment concurrency.

[1] 
[https://www.semanticscholar.org/paper/Efficient-and-robust-approximate-nearest-neighbor-Malkov-Yashunin/699a2e3b653c69aff5cf7a9923793b974f8ca164]

 

*UPDATES:*
 * (1/12/2020) The up-to-date branch is: 
[https://gitbox.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/heads/jira/lucene-9004-aknn-2]

  was:
"Semantic" search based on machine-learned vector "embeddings" representing 
terms, queries and documents is becoming a must-have feature for a modern 
search engine. SOLR-12890 is exploring various approaches to this, including 
providing vector-based scoring functions. This is a spinoff issue from that.

The idea here is to explore approximate nearest-neighbor search. Researchers 
have found an approach based on navigating a 

[jira] [Updated] (LUCENE-9004) Approximate nearest vector search

2020-01-11 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-9004:
--
Description: 
"Semantic" search based on machine-learned vector "embeddings" representing 
terms, queries and documents is becoming a must-have feature for a modern 
search engine. SOLR-12890 is exploring various approaches to this, including 
providing vector-based scoring functions. This is a spinoff issue from that.

The idea here is to explore approximate nearest-neighbor search. Researchers 
have found an approach based on navigating a graph that partially encodes the 
nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
compared to exact nearest neighbor calculations) at a reasonable cost. This 
issue will explore implementing HNSW (hierarchical navigable small-world) 
graphs for the purpose of approximate nearest vector search (often referred to 
as KNN or k-nearest-neighbor search).

At a high level the way this algorithm works is this. First assume you have a 
graph that has a partial encoding of the nearest neighbor relation, with some 
short and some long-distance links. If this graph is built in the right way 
(has the hierarchical navigable small world property), then you can efficiently 
traverse it to find nearest neighbors (approximately) in log N time where N is 
the number of nodes in the graph. I believe this idea was pioneered in  [1]. 
The great insight in that paper is that if you use the graph search algorithm 
to find the K nearest neighbors of a new document while indexing, and then link 
those neighbors (undirectedly, ie both ways) to the new document, then the 
graph that emerges will have the desired properties.

The implementation I propose for Lucene is as follows. We need two new data 
structures to encode the vectors and the graph. We can encode vectors using a 
light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
dimension and have efficient conversion from bytes to floats). For the graph we 
can use {{SortedNumericDocValues}} where the values we encode are the docids of 
the related documents. Encoding the interdocument relations using docids 
directly will make it relatively fast to traverse the graph since we won't need 
to lookup through an id-field indirection. This choice limits us to building a 
graph-per-segment since it would be impractical to maintain a global graph for 
the whole index in the face of segment merges. However graph-per-segment is a 
very natural at search time - we can traverse each segments' graph 
independently and merge results as we do today for term-based search.

At index time, however, merging graphs is somewhat challenging. While indexing 
we build a graph incrementally, performing searches to construct links among 
neighbors. When merging segments we must construct a new graph containing 
elements of all the merged segments. Ideally we would somehow preserve the work 
done when building the initial graphs, but at least as a start I'd propose we 
construct a new graph from scratch when merging. The process is going to be  
limited, at least initially, to graphs that can fit in RAM since we require 
random access to the entire graph while constructing it: In order to add links 
bidirectionally we must continually update existing documents.

I think we want to express this API to users as a single joint 
{{KnnGraphField}} abstraction that joins together the vectors and the graph as 
a single joint field type. Mostly it just looks like a vector-valued field, but 
has this graph attached to it.

I'll push a branch with my POC and would love to hear comments. It has many 
nocommits, basic design is not really set, there is no Query implementation and 
no integration iwth IndexSearcher, but it does work by some measure using a 
standalone test class. I've tested with uniform random vectors and on my laptop 
indexed 10K documents in around 10 seconds and searched them at 95% recall 
(compared with exact nearest-neighbor baseline) at around 250 QPS. I haven't 
made any attempt to use multithreaded search for this, but it is amenable to 
per-segment concurrency.

[1] 
[https://www.semanticscholar.org/paper/Efficient-and-robust-approximate-nearest-neighbor-Malkov-Yashunin/699a2e3b653c69aff5cf7a9923793b974f8ca164]

 

*UPDATES:*
 * (1/12/2020) The up-to-date branch is: 
[https://github.com/apache/lucene-solr/tree/jira/lucene-9004-aknn-2]

  was:
"Semantic" search based on machine-learned vector "embeddings" representing 
terms, queries and documents is becoming a must-have feature for a modern 
search engine. SOLR-12890 is exploring various approaches to this, including 
providing vector-based scoring functions. This is a spinoff issue from that.

The idea here is to explore approximate nearest-neighbor search. Researchers 
have found an approach based on navigating a graph that partially encodes the 

[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2020-01-11 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013667#comment-17013667
 ] 

Tomoko Uchida commented on LUCENE-9004:
---

[~sokolov] thanks, I myself also have tested it with a real dataset that is 
generated from recent snapshot files of Japanese Wikipedia. Yes it seems like 
"functionally correct", although we should do more formal tests for measuring 
Recall (effectiveness). 
{quote}I think it's time to post back to a branch in the Apache git repository 
so we can enlist contributions from the community here to help this go forward. 
I'll try to get that done this weekend
{quote}
OK, I pushed the branch to the Apache Gitbox to let others who want to involve 
in this issue check out it and have a try. 
 
[https://gitbox.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/heads/jira/lucene-9004-aknn-2]
 This also includes a patch Xin-Chun Zhang. 
 Note: currently the new codec for the vectors and kNN graphs is placed in 
{{o.a.l.codecs.lucene90}}, I think we can move this to proper location when 
this is ready to be released.

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field 

[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

2020-01-11 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013667#comment-17013667
 ] 

Tomoko Uchida edited comment on LUCENE-9004 at 1/12/20 7:51 AM:


[~sokolov] thanks, I myself also have tested it with a real dataset that is 
generated from recent snapshot files of Japanese Wikipedia. Yes it seems like 
"functionally correct", although we should do more formal tests for measuring 
Recall (effectiveness).
{quote}I think it's time to post back to a branch in the Apache git repository 
so we can enlist contributions from the community here to help this go forward. 
I'll try to get that done this weekend
{quote}
OK, I pushed the branch to the Apache Gitbox to let others who want to involve 
in this issue check out it and have a try. While I feel it's far from being 
complete :), but agree with that the code is prepared to take in contributions 
from the community.
 
[https://gitbox.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/heads/jira/lucene-9004-aknn-2]
 This also includes a patch from Xin-Chun Zhang. 
 Note: currently the new codec for the vectors and kNN graphs is placed in 
{{o.a.l.codecs.lucene90}}, I think we can move this to proper location when 
this is ready to be released.


was (Author: tomoko uchida):
[~sokolov] thanks, I myself also have tested it with a real dataset that is 
generated from recent snapshot files of Japanese Wikipedia. Yes it seems like 
"functionally correct", although we should do more formal tests for measuring 
Recall (effectiveness). 
{quote}I think it's time to post back to a branch in the Apache git repository 
so we can enlist contributions from the community here to help this go forward. 
I'll try to get that done this weekend
{quote}
OK, I pushed the branch to the Apache Gitbox to let others who want to involve 
in this issue check out it and have a try. 
 
[https://gitbox.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/heads/jira/lucene-9004-aknn-2]
 This also includes a patch Xin-Chun Zhang. 
 Note: currently the new codec for the vectors and kNN graphs is placed in 
{{o.a.l.codecs.lucene90}}, I think we can move this to proper location when 
this is ready to be released.

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical 

[jira] [Commented] (LUCENE-9129) Updating from 7.X to 8.X breaks

2020-01-12 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014043#comment-17014043
 ] 

Tomoko Uchida commented on LUCENE-9129:
---

It's not a bug but an intended change related to this optimization. 
 https://issues.apache.org/jira/browse/LUCENE-4100

The issue isn't mentioned in "API Changes" but "Optimizations" section on the 
Change log: 
[https://lucene.apache.org/core/8_0_0/changes/Changes.html|https://lucene.apache.org/core/8_4_0/changes/Changes.html].
 Although it includes API changes, I think the location is appropriate for the 
main purpose of the issue.

In short, you need to implement {{Collector#scoreMode()}} and also discard 
{{#needsScores()}} when upgrading to Lucene 8.0+ as the log message says. 
 See: 
[https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/Collector.html]

Since there were many changes/optimizations between 7.x and 8.x, it's hard to 
make an exhaustive list for every breaking change in the APIs (in addition, the 
{{Collector}} interface is marked as "Expert:", that means this is for expert 
users who are familiar with Lucene internals). So could you refer the Javadocs 
when you encounter errors relating to the library version upgrade. The Git 
commit log and diff command will also be help for getting more detailed 
information.

 

 

> Updating from 7.X to 8.X breaks
> ---
>
> Key: LUCENE-9129
> URL: https://issues.apache.org/jira/browse/LUCENE-9129
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: xia0c
>Priority: Major
>
> Hi, during my upgrading process from 7.X to 8.X I found another code break. 
> {code:java}
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.index.LeafReaderContext;
> import org.apache.lucene.index.MultiDocValues;
> import org.apache.lucene.index.SortedDocValues;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.SimpleCollector;
> import org.apache.lucene.search.TopDocs;
> import org.apache.lucene.util.LongValues;
> import org.apache.solr.handler.component.FacetComponent.FacetContext;
> import org.apache.solr.search.DocSet;
> import org.apache.solr.search.DocSetUtil;
> import org.apache.lucene.index.IndexReader;
> public class TestLucene {
>   
>   private FacetContext fcontext;
>   private DocSet docs;
>   private IndexReader reader;
>   
>   
>   public void demo() throws  IOException {
>   
> DocSetUtil.collectSortedDocSet(docs, reader, new 
> SimpleCollector() {
> @Override 
> public boolean needsScores() { return false; }
> @Override
> protected void doSetNextReader(LeafReaderContext ctx) throws 
> IOException {
> // TODO
> }
>   @Override
>   public void collect(int doc) throws IOException {
>   // TODO Auto-generated method stub
>   
>   }
>   });
>   
>   }
> }
> {code}
> The code should pass before, but it throws an error:
> {code:java}
> [ERROR] /TestLucene.java:[32,82]  is not abstract and 
> does not override abstract method scoreMode() in 
> org.apache.lucene.search.Collector
> [ERROR] /TestLucene.java:[36,19] method does not override or implement a 
> method from a supertype
> {code}
> I try to find changes in the migration 
> guide(https://github.com/apache/lucene-solr/blob/branch_8x/lucene/MIGRATE.txt)
>  but I didn't find it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9129) Updating from 7.X to 8.X breaks

2020-01-12 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014043#comment-17014043
 ] 

Tomoko Uchida edited comment on LUCENE-9129 at 1/13/20 6:13 AM:


It's not a bug but an intended change related to this optimization. 
 https://issues.apache.org/jira/browse/LUCENE-4100

The issue isn't mentioned in "API Changes" but "Optimizations" section on the 
Change log: 
[https://lucene.apache.org/core/8_0_0/changes/Changes.html|https://lucene.apache.org/core/8_0_0/changes/Changes.html].
 Although it includes API changes, I think the location is appropriate for the 
main purpose of the issue.

In short, you need to implement {{Collector#scoreMode()}} and also discard 
{{#needsScores()}} when upgrading to Lucene 8.0+ as the log message says. 
 See: 
[https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/Collector.html]

Since there were many changes/optimizations between 7.x and 8.x, it's hard to 
make an exhaustive list for every breaking change in the APIs (in addition, the 
{{Collector}} interface is marked as "Expert:", that means this is for expert 
users who are familiar with Lucene internals). So could you refer the Javadocs 
when you encounter errors relating to the library version upgrade. The Git 
commit log and diff command will also be help for getting more detailed 
information.

 

 


was (Author: tomoko uchida):
It's not a bug but an intended change related to this optimization. 
 https://issues.apache.org/jira/browse/LUCENE-4100

The issue isn't mentioned in "API Changes" but "Optimizations" section on the 
Change log: 
[https://lucene.apache.org/core/8_0_0/changes/Changes.html|https://lucene.apache.org/core/8_4_0/changes/Changes.html].
 Although it includes API changes, I think the location is appropriate for the 
main purpose of the issue.

In short, you need to implement {{Collector#scoreMode()}} and also discard 
{{#needsScores()}} when upgrading to Lucene 8.0+ as the log message says. 
 See: 
[https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/Collector.html]

Since there were many changes/optimizations between 7.x and 8.x, it's hard to 
make an exhaustive list for every breaking change in the APIs (in addition, the 
{{Collector}} interface is marked as "Expert:", that means this is for expert 
users who are familiar with Lucene internals). So could you refer the Javadocs 
when you encounter errors relating to the library version upgrade. The Git 
commit log and diff command will also be help for getting more detailed 
information.

 

 

> Updating from 7.X to 8.X breaks
> ---
>
> Key: LUCENE-9129
> URL: https://issues.apache.org/jira/browse/LUCENE-9129
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: xia0c
>Priority: Major
>
> Hi, during my upgrading process from 7.X to 8.X I found another code break. 
> {code:java}
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.index.LeafReaderContext;
> import org.apache.lucene.index.MultiDocValues;
> import org.apache.lucene.index.SortedDocValues;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.SimpleCollector;
> import org.apache.lucene.search.TopDocs;
> import org.apache.lucene.util.LongValues;
> import org.apache.solr.handler.component.FacetComponent.FacetContext;
> import org.apache.solr.search.DocSet;
> import org.apache.solr.search.DocSetUtil;
> import org.apache.lucene.index.IndexReader;
> public class TestLucene {
>   
>   private FacetContext fcontext;
>   private DocSet docs;
>   private IndexReader reader;
>   
>   
>   public void demo() throws  IOException {
>   
> DocSetUtil.collectSortedDocSet(docs, reader, new 
> SimpleCollector() {
> @Override 
> public boolean needsScores() { return false; }
> @Override
> protected void doSetNextReader(LeafReaderContext ctx) throws 
> IOException {
> // TODO
> }
>   @Override
>   public void collect(int doc) throws IOException {
>   // TODO Auto-generated method stub
>   
>   }
>   });
>   
>   }
> }
> {code}
> The code should pass before, but it throws an error:
> {code:java}
> [ERROR] /TestLucene.java:[32,82]  is not abstract and 
> does not override abstract method scoreMode() in 
> org.apache.lucene.search.Collector
> [ERROR] /TestLucene.java:[36,19] method does not override or implement a 
> method from a supertype
> {code}
> I try to find changes in the migration 
> 

[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-12 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014065#comment-17014065
 ] 

Tomoko Uchida commented on LUCENE-9123:
---

Hi [~h.kazuaki],
 introducing the option {{discardCompoundToken}} looks fine to me, however, I 
think we shouldn't change signatures of the existing constructors for backwards 
compatibility (they are public interface, so we have to keep them during 8.x 
anyways). Instead, we can add a new constructor. Opinions?

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Priority: Major
> Attachments: LUCENE-9123.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-17 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018278#comment-17018278
 ] 

Tomoko Uchida commented on LUCENE-9123:
---

I thought the change in the behavior has very small or no impact for users who 
use the Tokenizer for searching, but yes it would affect users who use it for 
pure tokenization purpose.

While keeping backward compatibility (within the same major version) is 
important, not emitting compound tokens would be prefered to get along with 
succeeding token filters and compound tokens are not needed for most use cases. 
I think it'd be better that we change the behavior at some point.

How about this proposal: we can create two patches, one for the master and one 
for 8x. On 8x branch, add the new constructor so you can use it from the next 
update. There is no change in the default behavior. On the master branch, 
switch the default behavior (users who don't like the change can still swich 
back by using the full constructor).

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-17 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018504#comment-17018504
 ] 

Tomoko Uchida commented on LUCENE-9123:
---

{quote}OK. I will prepare another patch for the master branch.
{quote}
Thanks [~h.kazuaki]! Once the work is done I can create the patch for 8x branch 
by applying some modifications to yours, if you feel bothered by arranging two 
patches.
 Also we need to add some tests to {{TestJapaneseTokenizer}} and 
{{TestJapaneseTokenizerFactory}}. And according to the custom, the final patch 
to the master branch should be named "LUCENE-9123.patch" so can you please 
overwrite the obsolete patch instead of upload new ones?
{quote}Then, a person who is a maintainer of Japanese Tokenizer can choose how 
to merge the changes (who is responsible for Japanese Tokenizer for now?)
{quote}
I'm not sure if there is explicit maintainer on each Lucene module, 
theoretically every person who has write access to the ASF repo can commit any 
patches on his own responsibility.
 Let us wait for a few days and I will commit the patch if there are no other 
comments or objections.
 [~cm] do you have any feedback about this change?

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2020-01-17 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018518#comment-17018518
 ] 

Tomoko Uchida commented on LUCENE-9004:
---

Hi [~jtibshirani], 
 thanks for you comments/suggestions. I will check the links you mentioned.
{quote}It also suggests that graph-based kNN is an active research area and 
that there are likely to be improvements + new approaches that come out.
{quote}
Yes, there are so many proposed methods and their variants in this field. 
Currently I'm not fully sure that what is the most feasible approach for Lucene.

Also, I just noticed an issue that proposes a product quantization based 
approach - roughly speaking, it may need less disk and memory space than the 
graph based methods like HNSW but takes more indexing and query time costs: 
https://issues.apache.org/jira/browse/LUCENE-9136

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. Mostly it just looks like a vector-valued 
> field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many 
> nocommits, 

[jira] [Assigned] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-17 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida reassigned LUCENE-9123:
-

Assignee: Tomoko Uchida

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9129) Updating from 7.X to 8.X breaks

2020-01-13 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved LUCENE-9129.
---
Resolution: Not A Problem

> Updating from 7.X to 8.X breaks
> ---
>
> Key: LUCENE-9129
> URL: https://issues.apache.org/jira/browse/LUCENE-9129
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: xia0c
>Priority: Major
>
> Hi, during my upgrading process from 7.X to 8.X I found another code break. 
> {code:java}
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.index.LeafReaderContext;
> import org.apache.lucene.index.MultiDocValues;
> import org.apache.lucene.index.SortedDocValues;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.SimpleCollector;
> import org.apache.lucene.search.TopDocs;
> import org.apache.lucene.util.LongValues;
> import org.apache.solr.handler.component.FacetComponent.FacetContext;
> import org.apache.solr.search.DocSet;
> import org.apache.solr.search.DocSetUtil;
> import org.apache.lucene.index.IndexReader;
> public class TestLucene {
>   
>   private FacetContext fcontext;
>   private DocSet docs;
>   private IndexReader reader;
>   
>   
>   public void demo() throws  IOException {
>   
> DocSetUtil.collectSortedDocSet(docs, reader, new 
> SimpleCollector() {
> @Override 
> public boolean needsScores() { return false; }
> @Override
> protected void doSetNextReader(LeafReaderContext ctx) throws 
> IOException {
> // TODO
> }
>   @Override
>   public void collect(int doc) throws IOException {
>   // TODO Auto-generated method stub
>   
>   }
>   });
>   
>   }
> }
> {code}
> The code should pass before, but it throws an error:
> {code:java}
> [ERROR] /TestLucene.java:[32,82]  is not abstract and 
> does not override abstract method scoreMode() in 
> org.apache.lucene.search.Collector
> [ERROR] /TestLucene.java:[36,19] method does not override or implement a 
> method from a supertype
> {code}
> I try to find changes in the migration 
> guide(https://github.com/apache/lucene-solr/blob/branch_8x/lucene/MIGRATE.txt)
>  but I didn't find it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-17 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018187#comment-17018187
 ] 

Tomoko Uchida commented on LUCENE-9123:
---

{{quote}}
However, I don't think there are many situations that we need original tokens 
along with decompound ones
{{quote}}

Personally I agree with that. Concerning full text searching, we rarely need 
original tokens when we use the "search mode". Why don't we set 
"discardCompoundToken" to true by default from here (I think this minor change 
in behaviour is Okay for next 8.x release)?

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-17 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018187#comment-17018187
 ] 

Tomoko Uchida edited comment on LUCENE-9123 at 1/17/20 4:59 PM:


{quote}
However, I don't think there are many situations that we need original tokens 
along with decompound ones
{quote}

Personally I agree with that. Concerning full text searching, we rarely need 
original tokens when we use the "search mode". Why don't we set 
"discardCompoundToken" to true by default from here (I think this minor change 
in behaviour is Okay for next 8.x release)?


was (Author: tomoko uchida):
{{quote}}
However, I don't think there are many situations that we need original tokens 
along with decompound ones
{{quote}}

Personally I agree with that. Concerning full text searching, we rarely need 
original tokens when we use the "search mode". Why don't we set 
"discardCompoundToken" to true by default from here (I think this minor change 
in behaviour is Okay for next 8.x release)?

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2020-01-10 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012942#comment-17012942
 ] 

Tomoko Uchida commented on LUCENE-9004:
---

Hi,
 it seems that some devs are strongly interested in this issue and I privately 
have received feedback (and expectations). So I just wanted to share my latest 
WIP branch.
 
[https://github.com/mocobeta/lucene-solr-mirror/commits/jira/LUCENE-9004-aknn-2]
 And an usage code snippet for that is: 
[https://gist.github.com/mocobeta/a5b18506ebc933c0afa7ab61d1dd2295]

I introduced a brand new codecs and indexer for vector search so this no longer 
depends on DocValues, though it's still on pretty early stage (especially, 
segment merging is not yet implemented).


 I intend to continue to work and I'll do my best, but to be honest I am not 
sure if my approach is the best - or I can create a great patch that can be 
merged to Lucene core... I welcome that someone takes over it in some 
different, more sophisticated/efficient ways. My current attempt might be 
useful as a reference or the starting point.
  

 

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. 

[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

2020-01-10 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012942#comment-17012942
 ] 

Tomoko Uchida edited comment on LUCENE-9004 at 1/10/20 3:02 PM:


Hi,
 it seems that some devs are strongly interested in this issue and I privately 
have received feedback (and expectations). So I just wanted to share my latest 
WIP branch.
 
[https://github.com/mocobeta/lucene-solr-mirror/commits/jira/LUCENE-9004-aknn-2]
 And here is an usage code snippet for that: 
[https://gist.github.com/mocobeta/a5b18506ebc933c0afa7ab61d1dd2295]

I introduced a brand new codec and indexer for vector search so this no longer 
depends on DocValues, though it's still on pretty early stage (especially, 
segment merging is not yet implemented).

I intend to continue to work and I'll do my best, but to be honest I am not 
sure if my approach is the best - or I can create a great patch that can be 
merged to Lucene core... I welcome that someone takes over it in some 
different, more sophisticated/efficient ways. My current attempt might be 
useful as a reference or the starting point.
  

 


was (Author: tomoko uchida):
Hi,
 it seems that some devs are strongly interested in this issue and I privately 
have received feedback (and expectations). So I just wanted to share my latest 
WIP branch.
 
[https://github.com/mocobeta/lucene-solr-mirror/commits/jira/LUCENE-9004-aknn-2]
 And an usage code snippet for that is: 
[https://gist.github.com/mocobeta/a5b18506ebc933c0afa7ab61d1dd2295]

I introduced a brand new codecs and indexer for vector search so this no longer 
depends on DocValues, though it's still on pretty early stage (especially, 
segment merging is not yet implemented).


 I intend to continue to work and I'll do my best, but to be honest I am not 
sure if my approach is the best - or I can create a great patch that can be 
merged to Lucene core... I welcome that someone takes over it in some 
different, more sophisticated/efficient ways. My current attempt might be 
useful as a reference or the starting point.
  

 

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - 

[jira] [Commented] (LUCENE-9119) Support (re-implement) "reconstruct" feature

2020-01-07 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009764#comment-17009764
 ] 

Tomoko Uchida commented on LUCENE-9119:
---

Main reason why they need "re-construct" is:

> Seeing the terms together in the order they went in, with position gaps helps 
> us understand why (say) a phrase query which spans multiple adjacent terms 
> didn’t match.

 

> Support (re-implement) "reconstruct" feature
> 
>
> Key: LUCENE-9119
> URL: https://issues.apache.org/jira/browse/LUCENE-9119
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/luke
>Reporter: Tomoko Uchida
>Priority: Major
>
> I have dropped the "reconstruct" feature from Documents tab when porting 
> Thinlet to Swing. However, there is a strong request for supporting this 
> feature.
> [https://github.com/DmitryKey/luke/pull/177]
> I think it's easlily possible and not hurmful to just restore terms (with 
> their positions) and show them in a popup for a field.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



  1   2   3   4   5   6   7   8   9   10   >