[jira] [Comment Edited] (LUCENE-10397) KnnVectorQuery doesn't tie break by doc ID

2022-07-13 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566315#comment-17566315
 ] 

Alessandro Benedetti edited comment on LUCENE-10397 at 7/13/22 1:15 PM:


Hi Lu,
I was not aware of this Jira issue.
Yes, this should have been resolved by my contribution.
Cheers


was (Author: alessandro.benedetti):
Hi Lu,
I was not aware of this issue.
Yes, this should have been resolved by my contribution.
Cheers

> KnnVectorQuery doesn't tie break by doc ID
> --
>
> Key: LUCENE-10397
> URL: https://issues.apache.org/jira/browse/LUCENE-10397
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> I was expecting KnnVectorQUery to tie-break by doc ID so that if multiple 
> documents get the same score then the ones that have the lowest doc ID would 
> get returned first, similarly to how SortField.SCORE also tie-breaks by doc 
> ID.
> However the following test fails, suggesting that it is not the case.
> {code:java}
>   public void testTieBreak() throws IOException {
> try (Directory d = newDirectory()) {
>   try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
> for (int j = 0; j < 5; j++) {
>   Document doc = new Document();
>   doc.add(
>   new KnnVectorField("field", new float[] {0, 1}, 
> VectorSimilarityFunction.DOT_PRODUCT));
>   w.addDocument(doc);
> }
>   }
>   try (IndexReader reader = DirectoryReader.open(d)) {
> assertEquals(1, reader.leaves().size());
> IndexSearcher searcher = new IndexSearcher(reader);
> KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 
> 3}, 3);
> TopDocs topHits = searcher.search(query, 3);
> assertEquals(0, topHits.scoreDocs[0].doc);
> assertEquals(1, topHits.scoreDocs[1].doc);
> assertEquals(2, topHits.scoreDocs[2].doc);
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10397) KnnVectorQuery doesn't tie break by doc ID

2022-07-13 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566315#comment-17566315
 ] 

Alessandro Benedetti commented on LUCENE-10397:
---

Hi Lu,
I was not aware of this issue.
Yes, this should have been resolved by my contribution.
Cheers

> KnnVectorQuery doesn't tie break by doc ID
> --
>
> Key: LUCENE-10397
> URL: https://issues.apache.org/jira/browse/LUCENE-10397
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> I was expecting KnnVectorQUery to tie-break by doc ID so that if multiple 
> documents get the same score then the ones that have the lowest doc ID would 
> get returned first, similarly to how SortField.SCORE also tie-breaks by doc 
> ID.
> However the following test fails, suggesting that it is not the case.
> {code:java}
>   public void testTieBreak() throws IOException {
> try (Directory d = newDirectory()) {
>   try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
> for (int j = 0; j < 5; j++) {
>   Document doc = new Document();
>   doc.add(
>   new KnnVectorField("field", new float[] {0, 1}, 
> VectorSimilarityFunction.DOT_PRODUCT));
>   w.addDocument(doc);
> }
>   }
>   try (IndexReader reader = DirectoryReader.open(d)) {
> assertEquals(1, reader.leaves().size());
> IndexSearcher searcher = new IndexSearcher(reader);
> KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 
> 3}, 3);
> TopDocs topHits = searcher.search(query, 3);
> assertEquals(0, topHits.scoreDocs[0].doc);
> assertEquals(1, topHits.scoreDocs[1].doc);
> assertEquals(2, topHits.scoreDocs[2].doc);
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10593) VectorSimilarityFunction reverse removal

2022-06-29 Thread Alessandro Benedetti (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Alessandro Benedetti resolved as Fixed  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Lucene - Core /  LUCENE-10593  
 
 
  VectorSimilarityFunction reverse removal   
 

  
 
 
 
 

 
Change By: 
 Alessandro Benedetti  
 
 
Resolution: 
 Fixed  
 
 
Status: 
 Open Resolved  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Updated] (LUCENE-10593) VectorSimilarityFunction reverse removal

2022-06-29 Thread Alessandro Benedetti (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Alessandro Benedetti updated an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Lucene - Core /  LUCENE-10593  
 
 
  VectorSimilarityFunction reverse removal   
 

  
 
 
 
 

 
Change By: 
 Alessandro Benedetti  
 
 
Fix Version/s: 
 9.3  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Assigned] (LUCENE-10593) VectorSimilarityFunction reverse removal

2022-06-29 Thread Alessandro Benedetti (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Alessandro Benedetti assigned an issue to Alessandro Benedetti  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Lucene - Core /  LUCENE-10593  
 
 
  VectorSimilarityFunction reverse removal   
 

  
 
 
 
 

 
Change By: 
 Alessandro Benedetti  
 
 
Assignee: 
 Alessandro Benedetti  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10593) VectorSimilarityFunction reverse removal

2022-06-23 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558141#comment-17558141
 ] 

Alessandro Benedetti commented on LUCENE-10593:
---

Recent performance tests in the Pull Request.
There's no evidence of slowing down, so this refactor seems good to go to me.
Functional tests are all green.

Planning to continue discussions and merge next week.

> VectorSimilarityFunction reverse removal
> 
>
> Key: LUCENE-10593
> URL: https://issues.apache.org/jira/browse/LUCENE-10593
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alessandro Benedetti
>Priority: Major
>  Labels: vector-based-search
>
> org.apache.lucene.index.VectorSimilarityFunction#EUCLIDEAN similarity behaves 
> in an opposite way in comparison to the other similarities:
> A higher similarity score means higher distance, for this reason, has been 
> marked with "reversed" and a function is present to map from the similarity 
> to a score (where higher means closer, like in all other similarities.)
> Having this counterintuitive behavior with no apparent explanation I could 
> find(please correct me if I am wrong) brings a lot of nasty side effects for 
> the code readability, especially when combined with the NeighbourQueue that 
> has a "reversed" itself.
> In addition, it complicates also the usage of the pattern:
> Result Queue -> MIN HEAP
> Candidate Queue -> MAX HEAP
> In HNSW searchers.
> The proposal in my Pull Request aims to:
> 1) the Euclidean similarity just returns the score, in line with the other 
> similarities, with the formula currently used to move from distance to score
> 2) simplify the code, removing the bound checker that's not necessary anymore
> 3) refactor here and there to be in line with the simplification
> 4) refactor of NeighborQueue to clearly state when it's a MIN_HEAP or 
> MAX_HEAP, now debugging is much easier and understanding the HNSW code is 
> much more intuitive



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] (LUCENE-10593) VectorSimilarityFunction reverse removal

2022-06-23 Thread Alessandro Benedetti (Jira)


[ https://issues.apache.org/jira/browse/LUCENE-10593 ]


Alessandro Benedetti deleted comment on LUCENE-10593:
---

was (Author: alessandro.benedetti):
Hi @msokolov @mayya-sharipova and @jtibshirani , I have finally finished my 
performance tests.
Initially the results were worse in this branch, I found that suspicious as I 
expected the removal of the BoundChecker and the removal of the reverse 
mechanism to outweigh the additional division in the distance measure during 
graph building and searching.

After a deep investigation I found the culprit (you see it in the latest 
commit).


{code:java}
if (neighborSimilarity >= score) {
if ((neighborSimilarity < score) == false) { // this version improves the 
performance dramatically in both indexing/searching
{code}


After that fix, the results are very encouraging.
There are strong speedup for both angular and euclidean distances, both for 
indexing and searching.
*If this is validated we are getting a great cleanup of the code and also a 
nice performance boost.*
I'll have my colleague @eliaporciani to repeat the tests on Apple M1.

The following tests were executed on Intellij running the 
org.apache.lucene.util.hnsw.KnnGraphTester.
2.4 GHz 8-Core Intel Core i9 - 32 GB 2667 MHz DDR4


{noformat}
`INDEXING EUCLIDEAN

-beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -metric euclidean

ORIGINAL
IW 0 [2022-06-22T14:00:12.647030Z; main]: 64335 msec to write vectors
IW 0 [2022-06-22T14:01:57.425108Z; main]: 65710 msec to write vectors
IW 0 [2022-06-22T14:03:18.052900Z; main]: 64817 msec to write vectors

THIS BRANCH
IW 0 [2022-06-22T14:04:50.683607Z; main]: 6597 msec to write vectors
IW 0 [2022-06-22T14:05:34.090801Z; main]: 6687 msec to write vectors
IW 0 [2022-06-22T14:06:00.268309Z; main]: 6564 msec to write vectors

INDEXING ANGULAR

-beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -metric angular

ORIGINAL
IW 0 [2022-06-22T13:55:45.401310Z; main]: 32897 msec to write vectors
IW 0 [2022-06-22T13:56:39.737642Z; main]: 33255 msec to write vectors
IW 0 [2022-06-22T13:57:31.172709Z; main]: 32576 msec to write vectors

THIS BRANCH
IW 0 [2022-06-22T13:52:06.085790Z; main]: 25261 msec to write vectors
IW 0 [2022-06-22T13:52:51.022766Z; main]: 25775 msec to write vectors
IW 0 [2022-06-22T13:53:47.565833Z; main]: 24523 msec to write vectors`

`SEARCH EUCLIDEAN

-niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -search 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -metric euclidean

ORIGINAL
completed 500 searches in 1026 ms: 487 QPS CPU time=1025ms
completed 500 searches in 1030 ms: 485 QPS CPU time=1029ms
completed 500 searches in 1031 ms: 484 QPS CPU time=1030ms

THIS BRANCH
completed 500 searches in 46 ms: 10869 QPS CPU time=46ms
completed 500 searches in 46 ms: 10869 QPS CPU time=46ms
completed 500 searches in 47 ms: 10638 QPS CPU time=46ms

SEARCH ANGULAR

-niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -search 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -metric angular

ORIGINAL
completed 500 searches in 154 ms: 3246 QPS CPU time=153ms
completed 500 searches in 162 ms: 3086 QPS CPU time=162ms
completed 500 searches in 166 ms: 3012 QPS CPU time=166ms

THIS BRANCH
completed 500 searches in 62 ms: 8064 QPS CPU time=62ms
completed 500 searches in 65 ms: 7692 QPS CPU time=65ms
completed 500 searches in 63 ms: 7936 QPS CPU time=62ms
`
{noformat}



Please correct me in case I did anything wrong, it's the first time I was using 
the org.apache.lucene.util.hnsw.KnnGraphTester

I am proceeding in running additional performance tests on different datasets.
Functional tests are all green.

> VectorSimilarityFunction reverse removal
> 
>
> Key: LUCENE-10593
> URL: https://issues.apache.org/jira/browse/LUCENE-10593
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alessandro Benedetti
>Priority: Major
>  Labels: vector-based-search
>
> org.apache.lucene.index.VectorSimilarityFunction#EUCLIDEAN similarity behaves 
> in an opposite way in comparison to the other similarities:
> A higher similarity score means higher distance, for this reason, has been 
> marked with "reversed" and a function is present to map from the similarity 
> to a score (where higher means closer, like in all other similarities.)
> Having this counterintuitive behavior with no apparent explanation I could 
> 

[jira] [Comment Edited] (LUCENE-10593) VectorSimilarityFunction reverse removal

2022-06-23 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558093#comment-17558093
 ] 

Alessandro Benedetti edited comment on LUCENE-10593 at 6/23/22 1:34 PM:


Hi @msokolov @mayya-sharipova and @jtibshirani , I have finally finished my 
performance tests.
Initially the results were worse in this branch, I found that suspicious as I 
expected the removal of the BoundChecker and the removal of the reverse 
mechanism to outweigh the additional division in the distance measure during 
graph building and searching.

After a deep investigation I found the culprit (you see it in the latest 
commit).


{code:java}
if (neighborSimilarity >= score) {
if ((neighborSimilarity < score) == false) { // this version improves the 
performance dramatically in both indexing/searching
{code}


After that fix, the results are very encouraging.
There are strong speedup for both angular and euclidean distances, both for 
indexing and searching.
*If this is validated we are getting a great cleanup of the code and also a 
nice performance boost.*
I'll have my colleague @eliaporciani to repeat the tests on Apple M1.

The following tests were executed on Intellij running the 
org.apache.lucene.util.hnsw.KnnGraphTester.
2.4 GHz 8-Core Intel Core i9 - 32 GB 2667 MHz DDR4


{noformat}
`INDEXING EUCLIDEAN

-beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -metric euclidean

ORIGINAL
IW 0 [2022-06-22T14:00:12.647030Z; main]: 64335 msec to write vectors
IW 0 [2022-06-22T14:01:57.425108Z; main]: 65710 msec to write vectors
IW 0 [2022-06-22T14:03:18.052900Z; main]: 64817 msec to write vectors

THIS BRANCH
IW 0 [2022-06-22T14:04:50.683607Z; main]: 6597 msec to write vectors
IW 0 [2022-06-22T14:05:34.090801Z; main]: 6687 msec to write vectors
IW 0 [2022-06-22T14:06:00.268309Z; main]: 6564 msec to write vectors

INDEXING ANGULAR

-beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -metric angular

ORIGINAL
IW 0 [2022-06-22T13:55:45.401310Z; main]: 32897 msec to write vectors
IW 0 [2022-06-22T13:56:39.737642Z; main]: 33255 msec to write vectors
IW 0 [2022-06-22T13:57:31.172709Z; main]: 32576 msec to write vectors

THIS BRANCH
IW 0 [2022-06-22T13:52:06.085790Z; main]: 25261 msec to write vectors
IW 0 [2022-06-22T13:52:51.022766Z; main]: 25775 msec to write vectors
IW 0 [2022-06-22T13:53:47.565833Z; main]: 24523 msec to write vectors`

`SEARCH EUCLIDEAN

-niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -search 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -metric euclidean

ORIGINAL
completed 500 searches in 1026 ms: 487 QPS CPU time=1025ms
completed 500 searches in 1030 ms: 485 QPS CPU time=1029ms
completed 500 searches in 1031 ms: 484 QPS CPU time=1030ms

THIS BRANCH
completed 500 searches in 46 ms: 10869 QPS CPU time=46ms
completed 500 searches in 46 ms: 10869 QPS CPU time=46ms
completed 500 searches in 47 ms: 10638 QPS CPU time=46ms

SEARCH ANGULAR

-niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -search 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -metric angular

ORIGINAL
completed 500 searches in 154 ms: 3246 QPS CPU time=153ms
completed 500 searches in 162 ms: 3086 QPS CPU time=162ms
completed 500 searches in 166 ms: 3012 QPS CPU time=166ms

THIS BRANCH
completed 500 searches in 62 ms: 8064 QPS CPU time=62ms
completed 500 searches in 65 ms: 7692 QPS CPU time=65ms
completed 500 searches in 63 ms: 7936 QPS CPU time=62ms
`
{noformat}



Please correct me in case I did anything wrong, it's the first time I was using 
the org.apache.lucene.util.hnsw.KnnGraphTester

I am proceeding in running additional performance tests on different datasets.
Functional tests are all green.


was (Author: alessandro.benedetti):
Hi @msokolov @mayya-sharipova and @jtibshirani , I have finally finished my 
performance tests.
Initially the results were worse in this branch, I found that suspicious as I 
expected the removal of the BoundChecker and the removal of the reverse 
mechanism to outweigh the additional division in the distance measure during 
graph building and searching.

After a deep investigation I found the culprit (you see it in the latest 
commit).


{code:java}
if (neighborSimilarity >= score) {
if ((neighborSimilarity < score) == false) { // this version improves the 
performance dramatically in both indexing/searching
{code}


After that fix, the results are very encouraging.
There are strong speedup for both 

[jira] [Commented] (LUCENE-10593) VectorSimilarityFunction reverse removal

2022-06-23 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558093#comment-17558093
 ] 

Alessandro Benedetti commented on LUCENE-10593:
---

Hi @msokolov @mayya-sharipova and @jtibshirani , I have finally finished my 
performance tests.
Initially the results were worse in this branch, I found that suspicious as I 
expected the removal of the BoundChecker and the removal of the reverse 
mechanism to outweigh the additional division in the distance measure during 
graph building and searching.

After a deep investigation I found the culprit (you see it in the latest 
commit).


{code:java}
if (neighborSimilarity >= score) {
if ((neighborSimilarity < score) == false) { // this version improves the 
performance dramatically in both indexing/searching
{code}


After that fix, the results are very encouraging.
There are strong speedup for both angular and euclidean distances, both for 
indexing and searching.
*If this is validated we are getting a great cleanup of the code and also a 
nice performance boost.*
I'll have my colleague @eliaporciani to repeat the tests on Apple M1.

The following tests were executed on Intellij running the 
org.apache.lucene.util.hnsw.KnnGraphTester.
2.4 GHz 8-Core Intel Core i9 - 32 GB 2667 MHz DDR4


{noformat}
`INDEXING EUCLIDEAN

-beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -metric euclidean

ORIGINAL
IW 0 [2022-06-22T14:00:12.647030Z; main]: 64335 msec to write vectors
IW 0 [2022-06-22T14:01:57.425108Z; main]: 65710 msec to write vectors
IW 0 [2022-06-22T14:03:18.052900Z; main]: 64817 msec to write vectors

THIS BRANCH
IW 0 [2022-06-22T14:04:50.683607Z; main]: 6597 msec to write vectors
IW 0 [2022-06-22T14:05:34.090801Z; main]: 6687 msec to write vectors
IW 0 [2022-06-22T14:06:00.268309Z; main]: 6564 msec to write vectors

INDEXING ANGULAR

-beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -metric angular

ORIGINAL
IW 0 [2022-06-22T13:55:45.401310Z; main]: 32897 msec to write vectors
IW 0 [2022-06-22T13:56:39.737642Z; main]: 33255 msec to write vectors
IW 0 [2022-06-22T13:57:31.172709Z; main]: 32576 msec to write vectors

THIS BRANCH
IW 0 [2022-06-22T13:52:06.085790Z; main]: 25261 msec to write vectors
IW 0 [2022-06-22T13:52:51.022766Z; main]: 25775 msec to write vectors
IW 0 [2022-06-22T13:53:47.565833Z; main]: 24523 msec to write vectors`

`SEARCH EUCLIDEAN

-niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -search 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -metric euclidean

ORIGINAL
completed 500 searches in 1026 ms: 487 QPS CPU time=1025ms
completed 500 searches in 1030 ms: 485 QPS CPU time=1029ms
completed 500 searches in 1031 ms: 484 QPS CPU time=1030ms

THIS BRANCH
completed 500 searches in 46 ms: 10869 QPS CPU time=46ms
completed 500 searches in 46 ms: 10869 QPS CPU time=46ms
completed 500 searches in 47 ms: 10638 QPS CPU time=46ms

SEARCH ANGULAR

-niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 8 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -search 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -metric angular

ORIGINAL
completed 500 searches in 154 ms: 3246 QPS CPU time=153ms
completed 500 searches in 162 ms: 3086 QPS CPU time=162ms
completed 500 searches in 166 ms: 3012 QPS CPU time=166ms

THIS BRANCH
completed 500 searches in 62 ms: 8064 QPS CPU time=62ms
completed 500 searches in 65 ms: 7692 QPS CPU time=65ms
completed 500 searches in 63 ms: 7936 QPS CPU time=62ms
`
{noformat}



Please correct me in case I did anything wrong, it's the first time I was using 
the org.apache.lucene.util.hnsw.KnnGraphTester

> VectorSimilarityFunction reverse removal
> 
>
> Key: LUCENE-10593
> URL: https://issues.apache.org/jira/browse/LUCENE-10593
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alessandro Benedetti
>Priority: Major
>  Labels: vector-based-search
>
> org.apache.lucene.index.VectorSimilarityFunction#EUCLIDEAN similarity behaves 
> in an opposite way in comparison to the other similarities:
> A higher similarity score means higher distance, for this reason, has been 
> marked with "reversed" and a function is present to map from the similarity 
> to a score (where higher means closer, like in all other similarities.)
> Having this counterintuitive behavior with no apparent explanation I could 
> find(please correct me if I am wrong) brings a lot of nasty side 

[jira] [Commented] (LUCENE-10593) VectorSimilarityFunction reverse removal

2022-05-26 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542677#comment-17542677
 ] 

Alessandro Benedetti commented on LUCENE-10593:
---

https://github.com/apache/lucene/pull/926 has been opened, [~sokolov], 
[~mayya], [~julietibs] [~jpountz] feel free to review

> VectorSimilarityFunction reverse removal
> 
>
> Key: LUCENE-10593
> URL: https://issues.apache.org/jira/browse/LUCENE-10593
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alessandro Benedetti
>Priority: Major
>  Labels: vector-based-search
>
> org.apache.lucene.index.VectorSimilarityFunction#EUCLIDEAN similarity behaves 
> in an opposite way in comparison to the other similarities:
> A higher similarity score means higher distance, for this reason, has been 
> marked with "reversed" and a function is present to map from the similarity 
> to a score (where higher means closer, like in all other similarities.)
> Having this counterintuitive behavior with no apparent explanation I could 
> find(please correct me if I am wrong) brings a lot of nasty side effects for 
> the code readability, especially when combined with the NeighbourQueue that 
> has a "reversed" itself.
> In addition, it complicates also the usage of the pattern:
> Result Queue -> MIN HEAP
> Candidate Queue -> MAX HEAP
> In HNSW searchers.
> The proposal in my Pull Request aims to:
> 1) the Euclidean similarity just returns the score, in line with the other 
> similarities, with the formula currently used to move from distance to score
> 2) simplify the code, removing the bound checker that's not necessary anymore
> 3) refactor here and there to be in line with the simplification
> 4) refactor of NeighborQueue to clearly state when it's a MIN_HEAP or 
> MAX_HEAP, now debugging is much easier and understanding the HNSW code is 
> much more intuitive



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10510) Check module access prior to running gjf/spotless/errorprone tasks

2022-05-26 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542669#comment-17542669
 ] 

Alessandro Benedetti commented on LUCENE-10510:
---

[~dweiss] Your help has been pure gold, thank you very much!!

I had to delete the gradle.properties and run ./gradlew tidy twice.
The first time I got the error again and the second time it went ok.

Should we document that more clearly?
Do you why this happens?
the "occasionally you may have to manually delete (or move) this
file and regenerate from scratch." didn't caught my attention

> Check module access prior to running gjf/spotless/errorprone tasks
> --
>
> Key: LUCENE-10510
> URL: https://issues.apache.org/jira/browse/LUCENE-10510
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
> Fix For: 9.2
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> PR at: [https://github.com/apache/lucene/pull/802]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10593) VectorSimilarityFunction reverse removal

2022-05-26 Thread Alessandro Benedetti (Jira)
Alessandro Benedetti created LUCENE-10593:
-

 Summary: VectorSimilarityFunction reverse removal
 Key: LUCENE-10593
 URL: https://issues.apache.org/jira/browse/LUCENE-10593
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Alessandro Benedetti


org.apache.lucene.index.VectorSimilarityFunction#EUCLIDEAN similarity behaves 
in an opposite way in comparison to the other similarities:
A higher similarity score means higher distance, for this reason, has been 
marked with "reversed" and a function is present to map from the similarity to 
a score (where higher means closer, like in all other similarities.)

Having this counterintuitive behavior with no apparent explanation I could 
find(please correct me if I am wrong) brings a lot of nasty side effects for 
the code readability, especially when combined with the NeighbourQueue that has 
a "reversed" itself.
In addition, it complicates also the usage of the pattern:
Result Queue -> MIN HEAP
Candidate Queue -> MAX HEAP
In HNSW searchers.

The proposal in my Pull Request aims to:

1) the Euclidean similarity just returns the score, in line with the other 
similarities, with the formula currently used to move from distance to score

2) simplify the code, removing the bound checker that's not necessary anymore

3) refactor here and there to be in line with the simplification

4) refactor of NeighborQueue to clearly state when it's a MIN_HEAP or MAX_HEAP, 
now debugging is much easier and understanding the HNSW code is much more 
intuitive




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10510) Check module access prior to running gjf/spotless/errorprone tasks

2022-05-26 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542666#comment-17542666
 ] 

Alessandro Benedetti commented on LUCENE-10510:
---

I spent roughly one hour fighting with Gradle, Iwas trying to run ./gradlew 
tidy before the ./gradlew check:
I have a JDK 17 and all I get is always a vague:
"> Certain gradle tasks and plugins require access to jdk.compiler internals, 
your gradle.properties might have just been generated or could be out of sync 
(see help/localSettings.txt)"

I explored the code that generates the exception:

{code:java}
task checkJdkInternalsExportedToGradle() {
doFirst {
  def jdkCompilerModule = 
ModuleLayer.boot().findModule("jdk.compiler").orElseThrow()
  def gradleModule = getClass().module
  def internalsExported = [
  "com.sun.tools.javac.api",
  "com.sun.tools.javac.file",
  "com.sun.tools.javac.parser",
  "com.sun.tools.javac.tree",
  "com.sun.tools.javac.util"
  ].stream()
.allMatch(pkg -> jdkCompilerModule.isExported(pkg, gradleModule))

  if (!internalsExported) {
throw new GradleException(
"Certain gradle tasks and plugins require access to jdk.compiler" +
" internals, your gradle.properties might have just been 
generated or could be" +
" out of sync (see help/localSettings.txt)")
  }
}
{code}

And I also read the "help/localSettings.txt" with no success.
Maybe I am tired tonight, am I missing something?
I couldn't find any recommendation for how to fix the problem.
If I am not missing anything, we should do something as I assume a random new 
contributor would be lost


> Check module access prior to running gjf/spotless/errorprone tasks
> --
>
> Key: LUCENE-10510
> URL: https://issues.apache.org/jira/browse/LUCENE-10510
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
> Fix For: 9.2
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> PR at: [https://github.com/apache/lucene/pull/802]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10054) Handle hierarchy in HNSW graph

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10054:
--
Labels: vector-based-search  (was: )

> Handle hierarchy in HNSW graph
> --
>
> Key: LUCENE-10054
> URL: https://issues.apache.org/jira/browse/LUCENE-10054
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Mayya Sharipova
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.1
>
>  Time Spent: 20h 20m
>  Remaining Estimate: 0h
>
> Currently HNSW graph is represented as a single layer graph. 
>  We would like to extend it to handle hierarchy as per 
> [discussion|https://issues.apache.org/jira/browse/LUCENE-9004?focusedCommentId=17393216=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17393216].
>  
>  
> TODO tasks:
> - add multiple layers in the HnswGraph class
>  - modify the format in  Lucene90HnswVectorsWriter and 
> Lucene90HnswVectorsReader to handle multiple layers
> - modify graph construction and search algorithm to handle hierarchy
>  - run benchmarks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10183) KnnVectorsWriter#writeField should take a KnnVectorsReader, not a VectorValues instance

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10183:
--
Labels: vector-based-search  (was: )

> KnnVectorsWriter#writeField should take a KnnVectorsReader, not a 
> VectorValues instance
> ---
>
> Key: LUCENE-10183
> URL: https://issues.apache.org/jira/browse/LUCENE-10183
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>  Labels: vector-based-search
> Fix For: 9.1
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> By taking a VectorValues instance, KnnVectorsWriter#write doesn't let 
> implementations iterate over vectors multiple times if needed. It should take 
> a KnnVectorReaders similarly to doc values, where the writer takes a 
> DocValuesProducer.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10309) Minimum KnnVector codec support in Luke

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10309:
--
Labels: vector-based-search  (was: )

> Minimum KnnVector codec support in Luke
> ---
>
> Key: LUCENE-10309
> URL: https://issues.apache.org/jira/browse/LUCENE-10309
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: luke
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Minor
>  Labels: vector-based-search
> Fix For: 9.1, 10.0 (main)
>
> Attachments: Screenshot from 2021-12-12 14-40-41.png, Screenshot from 
> 2021-12-12 14-54-47.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> (For completeness,) Luke should show KnnVector format information in the 
> index browsing tab.
> If the type of a field is a KnnVector,
>  * Show flag "K"
>  * Show its dimension
>  * Show its similarity function
> More rich support for the codec - decoding or searching - could come later; I 
> don't know if there are such use-cases.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10351) Correct knn search failure with all deleted docs

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10351:
--
Labels: vector-based-search  (was: )

> Correct knn search failure with  all deleted docs
> -
>
> Key: LUCENE-10351
> URL: https://issues.apache.org/jira/browse/LUCENE-10351
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Mayya Sharipova
>Priority: Trivial
>  Labels: vector-based-search
> Fix For: 9.1
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Current when doing knn search on an segment where all documents with knn 
> field were deleted, we get the following error:
> maxSize must be > 0 and < 2147483630; got: 0
> java.lang.IllegalArgumentException: maxSize must be > 0 and < 2147483630; 
> got: 0
> at 
> __randomizedtesting.SeedInfo.seed([43F1F124D7076A4E:1B860BFCCB9B0BB5]:0)
> at org.apache.lucene.util.LongHeap.(LongHeap.java:57)
> at org.apache.lucene.util.LongHeap$1.(LongHeap.java:69)
> at org.apache.lucene.util.LongHeap.create(LongHeap.java:69)
> at 
> org.apache.lucene.util.hnsw.NeighborQueue.(NeighborQueue.java:41)
> at 
> org.apache.lucene.util.hnsw.HnswGraph.search(HnswGraph.java:105)#
> A desired behaviour: instead of an error,  an empty TopDocs should be 
> returned. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10382:
--
Labels: vector-based-search  (was: )

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.1
>
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10375) Speed up HNSW merge by writing combined vector data

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10375:
--
Labels: vector-based-search  (was: )

> Speed up HNSW merge by writing combined vector data
> ---
>
> Key: LUCENE-10375
> URL: https://issues.apache.org/jira/browse/LUCENE-10375
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.1
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> When merging segments together, the HNSW writer creates a VectorValues 
> instance that gives a merged view of all the segments' VectorValues. This 
> merged instance is used when constructing the new HNSW graph. Graph building 
> needs random access, and the merged VectorValues support this by mapping from 
> merged ordinals -> segments and segment ordinals.
> This mapping seems to add overhead. The nightly indexing benchmarks sometimes 
> show substantial time in Arrays.binarySearch (used to map an ordinal to a 
> segment): 
> https://blunders.io/jfr-demo/indexing-1kb-vectors-2022.01.09.18.03.19/top_down_cpu_samples.
> Instead of using a merged VectorValues to create the graph, maybe we could 
> first write all the segment vectors to a file, and use that file to build the 
> graph.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10391) Reuse data structures across HnswGraph invocations

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10391:
--
Labels: vector-based-search  (was: )

> Reuse data structures across HnswGraph invocations
> --
>
> Key: LUCENE-10391
> URL: https://issues.apache.org/jira/browse/LUCENE-10391
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Julie Tibshirani
>Priority: Minor
>  Labels: vector-based-search
> Fix For: 9.1
>
> Attachments: Screen Shot 2022-02-24 at 10.18.42 AM.png
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Creating HNSW graphs involves doing many repeated calls to HnswGraph#search. 
> Profiles from nightly benchmarks suggest that allocating data-structures 
> incurs both lots of heap allocations 
> ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_heap)]
>  and CPU usage 
> ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_cpu).]
>  It looks like reusing data structures across invocations would be a 
> low-hanging fruit that could help save significant CPU?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10408:
--
Labels: vector-based-search  (was: )

> Better dense encoding of doc Ids in Lucene91HnswVectorsFormat
> -
>
> Key: LUCENE-10408
> URL: https://issues.apache.org/jira/browse/LUCENE-10408
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
>  Labels: vector-based-search
> Fix For: 9.1
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Currently we write doc Ids of all documents that have vectors as is.  We 
> should improve their encoding either using delta encoding or bitset.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10421) Non-deterministic results from KnnVectorQuery?

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10421:
--
Labels: vector-based-search  (was: )

> Non-deterministic results from KnnVectorQuery?
> --
>
> Key: LUCENE-10421
> URL: https://issues.apache.org/jira/browse/LUCENE-10421
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.1
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> [Nightly benchmarks|https://home.apache.org/~mikemccand/lucenebench/] have 
> been upset for the past ~1.5 weeks because it looks like {{KnnVectorQuery}} 
> is giving slightly different results on every run, even on an identical 
> (deterministically constructed – single thread indexing, flush by doc count, 
> {{{}SerialMergeSchedule{}}}, {{{}LogDocCountMergePolicy{}}}, etc.) index each 
> night.  It produces failures like this, which then abort the benchmark to 
> help us catch any recent accidental bug that alters our precise top N search 
> hits and scores:
> {noformat}
>  Traceback (most recent call last):
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 2177, in 
>   run()
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 1225, in run
>   raise RuntimeError(‘search result differences: %s’ % str(errors))
> RuntimeError: search result differences: 
> [“query=KnnVectorQuery:vector[-0.07267512,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 4 has wrong field/score value ([20844660], 
> ‘0.92060816’) vs ([254438\
> 06], ‘0.920046’)“, “query=KnnVectorQuery:vector[-0.12073054,...][10] 
> filter=None sort=None groupField=None hitCount=10: hit 7 has wrong 
> field/score value ([25501982], ‘0.99630797’) vs ([13688085], ‘0.9961489’)“, 
> “qu\
> ery=KnnVectorQuery:vector[0.02227773,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 0 has wrong field/score value ([4741915], 
> ‘0.9481132’) vs ([14220828], ‘0.9579846’)“, “query=KnnVectorQuery:vector\
> [0.024077624,...][10] filter=None sort=None groupField=None hitCount=10: hit 
> 0 has wrong field/score value ([7472373], ‘0.8460249’) vs ([12577825], 
> ‘0.8378446’)“]{noformat}
> At first I thought this might be expected because of the recent (awesome!!) 
> improvements to HNSW, so I tried to simply "regold".  But the regold did not 
> "take", so it indeed looks like there is some non-determinism here.
> I pinged [~msoko...@gmail.com] and he found this random seeding that is most 
> likely the cause?
> {noformat}
> public final class HnswGraphBuilder {
>   /** Default random seed for level generation * */
>   private static final long DEFAULT_RAND_SEED = System.currentTimeMillis(); 
> {noformat}
> Can we somehow make this deterministic instead?  Or maybe the nightly 
> benchmarks could somehow pass something in to make results deterministic for 
> benchmarking?  Or ... we could also relax the benchmarks to accept 
> non-determinism for {{KnnVectorQuery}} task?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10453) Speed up VectorUtil#squareDistance

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10453:
--
Labels: vector-based-search  (was: )

> Speed up VectorUtil#squareDistance
> --
>
> Key: LUCENE-10453
> URL: https://issues.apache.org/jira/browse/LUCENE-10453
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Labels: vector-based-search
> Fix For: 9.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{VectorUtil#squareDistance}} is used in conjunction with 
> {{VectorSimilarityFunction#EUCLIDEAN}}.
> It didn't get as much love as dot products (LUCENE-9837) yet there seems to 
> be room for improvement. I wrote a quick JMH benchmark to run some 
> comparisons: https://github.com/jpountz/vector-similarity-benchmarks.
> While it's not as fast as using the vector API (which makes squareDistance 
> computations more than 2x faster), we can get a ~25% speedup by unrolling the 
> loop in a similar way to what dot product does.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9322) Discussing a unified vectors format API

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-9322:
-
Labels: vector-based-search  (was: )

> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 11h
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9837) try to improve performance of VectorUtil.dotProduct

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-9837:
-
Labels: vector-based-search  (was: )

> try to improve performance of VectorUtil.dotProduct
> ---
>
> Key: LUCENE-9837
> URL: https://issues.apache.org/jira/browse/LUCENE-9837
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> This is the king of cpu usage for the nightly benchmark. Let's see if we can 
> optimize it a bit.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9855) Reconsider names for ANN related format and APIs

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-9855:
-
Labels: vector-based-search  (was: )

> Reconsider names for ANN related format and APIs
> 
>
> Key: LUCENE-9855
> URL: https://issues.apache.org/jira/browse/LUCENE-9855
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 9.0
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Blocker
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> There is some discussion about the codec name for ann search.
> https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E
> Main points here are 1) use plural form for consistency, and 2) use more 
> specific name for ann search (second point could be optional).
> A few alternatives were proposed:
> - VectorsFormat
> - VectorValuesFormat
> - NeighborsFormat
> - DenseVectorsFormat



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9905) Revise approach to specifying NN algorithm

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-9905:
-
Labels: vector-based-search  (was: )

> Revise approach to specifying NN algorithm
> --
>
> Key: LUCENE-9905
> URL: https://issues.apache.org/jira/browse/LUCENE-9905
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Julie Tibshirani
>Priority: Blocker
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
> particular nearest-neighbor search data structure and algorithm. This 
> flexibility is important since NN search is a developing area and we'd like 
> to be able to experiment and evolve the algorithm. Right now we only have one 
> algorithm (HNSW), but we want to maintain the ability to use another.
> Currently the algorithm to use is specified through {{SearchStrategy}}, for 
> example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation 
> is expected to handle multiple algorithms. Instead we could have one format 
> implementation per algorithm. Our current implementation would be 
> HNSW-specific like {{HnswVectorFormat}}, and to experiment with another 
> algorithm you could create a new implementation like {{ClusterVectorFormat}}. 
> This would be better aligned with the codec framework, and help avoid 
> exposing algorithm details in the API.
> A concrete proposal (note many of these names will change when LUCENE-9855 is 
> addressed):
> # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
> HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
> # Remove references to HNSW in {{SearchStrategy}}, so there is just 
> {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something 
> like {{SimilarityFunction}}.
> # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
> beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
> # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
> parameters to be configured per-field \(?\)
> One note: the current HNSW-based format includes logic for storing a numeric 
> vector per document, as well as constructing + storing a HNSW graph. When 
> adding another implementation, it’d be nice to be able to reuse logic for 
> reading/ writing numeric vectors. I don’t think we need to design for this 
> right now, but we can keep it in mind for the future?
> This issue is based on a thread [~jpountz] started: 
> [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9908) Move VectorValues#search to VectorReader and LeafReader

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-9908:
-
Labels: vector-based-search  (was: )

> Move VectorValues#search to VectorReader and LeafReader
> ---
>
> Key: LUCENE-9908
> URL: https://issues.apache.org/jira/browse/LUCENE-9908
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Adrien Grand
>Assignee: Julie Tibshirani
>Priority: Blocker
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> As ANN search doesn't require a positioned iterator, we should move it from 
> {{VectorValues}} to {{VectorReader}} and make it available from 
> {{LeafReader}} via a new API, something like 
> {{LeafReader#searchNearestNeighbors}}?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10016) VectorReader.search needs rethought, o.a.l.search integration?

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10016:
--
Labels: vector-based-search  (was: )

> VectorReader.search needs rethought, o.a.l.search integration?
> --
>
> Key: LUCENE-10016
> URL: https://issues.apache.org/jira/browse/LUCENE-10016
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Blocker
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> There's no search integration (e.g. queries) for the current vector values, 
> no documentation/examples that I can find.
> Instead the codec has this method:
> {code}
> TopDocs search(String field, float[] target, int k, int fanout)
> {code}
> First, the "fanout" parameter needs to go, this is specific to HNSW impl, get 
> it out of here.
> Second, How am I supposed to skip over deleted documents? How can I use 
> filters? How should i search across multiple segments?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10015) Remove VectorValues.SimilarityFunction.NONE

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10015:
--
Labels: vector-based-search  (was: )

> Remove VectorValues.SimilarityFunction.NONE
> ---
>
> Key: LUCENE-10015
> URL: https://issues.apache.org/jira/browse/LUCENE-10015
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Blocker
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This stuff is HNSW-implementation specific. It can be moved to a codec 
> parameter.
> The NONE option should be removed: it just makes the codec more complex.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10040) Handle deletions in nearest vector search

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10040:
--
Labels: vector-based-search  (was: )

> Handle deletions in nearest vector search
> -
>
> Key: LUCENE-10040
> URL: https://issues.apache.org/jira/browse/LUCENE-10040
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Assignee: Julie Tibshirani
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Currently nearest vector search doesn't account for deleted documents. Even 
> if a document is not in {{LeafReader#getLiveDocs}}, it could still be 
> returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be 
> surprising + difficult for users, since other search APIs account for deleted 
> docs. We've discussed extending the search logic to take a parameter like 
> {{Bits liveDocs}}. This issue discusses options around adding support.
> One approach is to just filter out deleted docs after running the KNN search. 
> This behavior seems hard to work with as a user: fewer than {{k}} docs might 
> come back from your KNN search!
> Alternatively, {{LeafReader#searchNearestVectors}} could always return the 
> {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs 
> while assembling its candidate list. It would traverse further into the 
> graph, visiting more nodes to ensure it gathers the required candidates. 
> (Note deleted docs would still be visited/ traversed). The [hnswlib 
> library|https://github.com/nmslib/hnswlib] contains an implementation like 
> this, where you can mark documents as deleted and they're skipped during 
> search.
> This approach seems reasonable to me, but there are some challenges:
>  * Performance can be unpredictable. If deletions are random, it shouldn't 
> have a huge effect. But in the worst case, a segment could have 50% deleted 
> docs, and they all happen to be near the query vector. HNSW would need to 
> traverse through around half the entire graph to collect neighbors.
>  * As far as I know, there hasn't been academic research or any testing into 
> how well this performs in terms of recall. I have a vague intuition it could 
> be harder to achieve high recall as the algorithm traverses areas further 
> from the "natural" entry points. The HNSW paper doesn't mention deletions/ 
> filtering, and I haven't seen community benchmarks around it.
> Background links:
>  * Thoughts on deletions from the author of the HNSW paper: 
> [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892]
>  * Blog from Vespa team which mentions combining KNN and search filters (very 
> similar to applying deleted docs): 
> [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. 
> The "Exact vs Approximate" section shows good performance even when a large 
> percentage of documents are filtered out. The team mentioned to me they 
> didn't have the chance to measure recall, only latency.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10063) SimpleTextKnnVectorsReader.search needs an implementation

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10063:
--
Labels: vector-based-search  (was: )

> SimpleTextKnnVectorsReader.search needs an implementation
> -
>
> Key: LUCENE-10063
> URL: https://issues.apache.org/jira/browse/LUCENE-10063
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Blocker
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> SimpleText doesn't implement vector search today by throwing an 
> UnsupportedOperationException. We worked around this by disabling SimpleText 
> on tests that use vectors until now, but this isn't a good solution: 
> SimpleText should implement APIs correctly and only be disabled on tests that 
> expect a binary format or that are too slow with SimpleText.
> Let's implement this method via linear scan for now?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10142) use a better RNG for Hnsw vectors

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10142:
--
Labels: vector-based-search  (was: )

> use a better RNG for Hnsw vectors
> -
>
> Key: LUCENE-10142
> URL: https://issues.apache.org/jira/browse/LUCENE-10142
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.0
>
> Attachments: LUCENE-10142.patch
>
>
> When profiling indexing with vectors at 
> http://people.apache.org/~mikemccand/lucenebench/, I see a fair amount of 
> time spent in java.util.Random.
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> ...
> 7.30% 305461java.util.Random#nextInt()
> {noformat}
> We don't need its thread-safety guarantees (CAS loop etc). 
> We can use SplittableRandom as a drop-in replacement.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10130) HnswGraph could make use of a SparseFixedBitSet.getAndSet

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10130:
--
Labels: vector-based-search  (was: )

> HnswGraph could make use of a SparseFixedBitSet.getAndSet
> -
>
> Key: LUCENE-10130
> URL: https://issues.apache.org/jira/browse/LUCENE-10130
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.0
>
> Attachments: LUCENE-10130.patch, LUCENE-10130_round2.patch
>
>
> Currently HnswGraph uses SparseFixedBitSet "visited" to track where it has 
> already been. The logic currently looks like this:
> {code}
> if (visited.get(entryPoint) == false) {
>   visited.set(entryPoint);
>   ... logic ...
> }
> {code}
> If SparseFixedBitSet had a {{getAndSet}} (like FixedBitSet), the code could 
> be:
> {code}
> if (visited.getAndSet(entrypoint) == false) {
>   ... logic ...
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10146) Add VectorSimilarityFunction.COSINE

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10146:
--
Labels: vector-based-search  (was: )

> Add VectorSimilarityFunction.COSINE
> ---
>
> Key: LUCENE-10146
> URL: https://issues.apache.org/jira/browse/LUCENE-10146
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> To perform ANN search with cosine similarity, users are expected to normalize 
> the document and query vectors to unit length, then use 
> {{VectorSimilarityFunction.DOT_PRODUCT}}. I think it would be good to also 
> support cosine similarity directly through 
> {{VectorSimilarityFunction.COSINE}}. This would allow users to perform ANN 
> based on cosine similarity, while retaining access to the original vectors 
> through {{VectorValues}}. That way they can use the original vectors in a 
> reranking step or return them to the application for further processing.
> It looks like nmslib and hnswlib support cosine similarity. On the other 
> hand, FAISS only supports dot product and suggests users normalize the 
> vectors to perform cosine similarity 
> (https://github.com/facebookresearch/faiss/issues/95). To me adding this one 
> additional similarity is worth it in terms of what it lets users accomplish.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10178) Add toString for inspecting Lucene90HnswVectorsFormat

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10178:
--
Labels: vector-based-search  (was: )

> Add toString for inspecting Lucene90HnswVectorsFormat
> -
>
> Key: LUCENE-10178
> URL: https://issues.apache.org/jira/browse/LUCENE-10178
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Priority: Trivial
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Since `Lucene90HnswVectorsFormat` has a number of parameters,  it is useful 
> for testing and debugging to add 
> `toString()` method that will output `maxConn` and `beamWidth` .



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9004) Approximate nearest vector search

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-9004:
-
Labels: vector-based-search  (was: )

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Assignee: Michael Sokolov
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.0
>
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. Mostly it just looks like a vector-valued 
> field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many 
> nocommits, basic design is not really set, there is no Query implementation 
> and no integration iwth IndexSearcher, but it does work by some measure using 
> a standalone test class. I've tested with uniform random vectors and on my 
> laptop indexed 10K documents in around 10 seconds and searched them at 95% 
> recall (compared with exact nearest-neighbor baseline) at around 250 QPS. I 
> haven't made any attempt to use multithreaded search for this, but it is 
> amenable to per-segment concurrency.
> [1] 
> 

[jira] [Updated] (LUCENE-10228) PerFieldKnnVectorsFormat can write to wrong format name

2022-04-20 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10228:
--
Labels: vector-based-search  (was: )

> PerFieldKnnVectorsFormat can write to wrong format name
> ---
>
> Key: LUCENE-10228
> URL: https://issues.apache.org/jira/browse/LUCENE-10228
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.0, 9.1
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently when creating a KnnVectorsWriter for merging, we consult the 
> existing "PER_FIELD_SUFFIX_KEY" attribute to determine the format's per-field 
> suffix. This isn't correct since we could be using a new codec (that produces 
> different formats/ suffixes).
> The attached PR modifies TestPerFieldDocValuesFormat#testMergeUsesNewFormat 
> to trigger the problem. Without the fix we get an error like 
> "java.nio.file.FileAlreadyExistsException: File 
> "_3_Lucene90HnswVectorsFormat_0.vem" was already written to."
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8216) Better cross-field scoring

2021-06-04 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17357270#comment-17357270
 ] 

Alessandro Benedetti commented on LUCENE-8216:
--

[~jim.ferenczi] thank you very much for your prompt response, it makes sense!



> Better cross-field scoring
> --
>
> Key: LUCENE-8216
> URL: https://issues.apache.org/jira/browse/LUCENE-8216
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Jim Ferenczi
>Priority: Major
> Fix For: 8.0
>
> Attachments: LUCENE-8216.patch, LUCENE-8216.patch
>
>
> I'd like Lucene to have better support for scoring across multiple fields. 
> Today we have BlendedTermQuery which tries to help there but it probably 
> tries to do too much on some aspects (handling cross-field term queries AND 
> synonyms) and too little on other ones (it tries to merge index-level 
> statistics, but not per-document statistics like tf and norm).
> Maybe we could implement something like BM25F so that queries across multiple 
> fields would retain the benefits of BM25 like the fact that the impact of the 
> term frequency saturates quickly, which is not the case with BlendedTermQuery 
> if you have occurrences across many fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8216) Better cross-field scoring

2021-06-04 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17357258#comment-17357258
 ] 

Alessandro Benedetti commented on LUCENE-8216:
--

hi [~jim.ferenczi] I am investigating BM25F in Lucene and Solr and I ended up 
here: org/apache/lucene/sandbox/search/CombinedFieldQuery.java:289 
When calculating the IDF in BM25F we do that across fields, so as far as I 
explored the matter in my investigation yet the Document Frequency for a term T 
should be:

Number of documents in the corpus that contains the term T (in any field).

So effectively it would be the cardinality of the set that is the Union of all 
the posting lists for such term across the various fields.

>From a quick look at your code, the document frequency is just calcolated as 
>the max document frequency, across all the field involved (and that is 
>actually the lower bound of the real blended document frequency).
Was it done with this approximation for simplicity, or there's any other reason?

> Better cross-field scoring
> --
>
> Key: LUCENE-8216
> URL: https://issues.apache.org/jira/browse/LUCENE-8216
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Jim Ferenczi
>Priority: Major
> Fix For: 8.0
>
> Attachments: LUCENE-8216.patch, LUCENE-8216.patch
>
>
> I'd like Lucene to have better support for scoring across multiple fields. 
> Today we have BlendedTermQuery which tries to help there but it probably 
> tries to do too much on some aspects (handling cross-field term queries AND 
> synonyms) and too little on other ones (it tries to merge index-level 
> statistics, but not per-document statistics like tf and norm).
> Maybe we could implement something like BM25F so that queries across multiple 
> fields would retain the benefits of BM25 like the fact that the impact of the 
> term frequency saturates quickly, which is not the case with BlendedTermQuery 
> if you have occurrences across many fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-15149) Learning To Rank model upload fails generically

2021-02-18 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti resolved SOLR-15149.
-
Resolution: Fixed

> Learning To Rank model upload fails generically
> ---
>
> Key: SOLR-15149
> URL: https://issues.apache.org/jira/browse/SOLR-15149
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - LTR
>Reporter: Alessandro Benedetti
>Assignee: Alessandro Benedetti
>Priority: Major
> Fix For: 8.9
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When uploading a model, using a not existent store or other incorrect 
> parameters you get:
> "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","java.lang.ClassCastException"],
> "msg":"org.apache.solr.ltr.model.ModelException: Model type does not 
> exist org.apache.solr.ltr.model.LinearModel",
> "code":400}}
> In the response, logs don't help that much out of the box, I had to go for 
> remote debugging and of course we don't want the generic user to do that.
> Reason is in org/apache/solr/ltr/model/LTRScoringModel.java:111
> {code:java}
> try {
>   // create an instance of the model
>   model = solrResourceLoader.newInstance(
>   className,
>   LTRScoringModel.class,
>   new String[0], // no sub packages
>   new Class[] { String.class, List.class, List.class, String.class, 
> List.class, Map.class },
>   new Object[] { name, features, norms, featureStoreName, 
> allFeatures, params });
>   if (params != null) {
> SolrPluginUtils.invokeSetters(model, params.entrySet());
>   }
> } catch (final Exception e) {
>   throw new ModelException("Model type does not exist " + className, e);
> }
> {code}
> This happens when:
> - use a not existent feature store
> - use not existent feature
> - use an integer instead of Double as a weight in a linear model
> unless any objection, we should improve such message with the real one



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15149) Learning To Rank model upload fails generically

2021-02-16 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17285514#comment-17285514
 ] 

Alessandro Benedetti commented on SOLR-15149:
-

Thanks for all the good information [~hossman], you are just right.
I run my tests but after merging some additional code I didn't re-run them 
before the final merge relying on the Pull Requests checks which ended up being 
just:


{noformat}
All checks have passed
2 successful checks
@github-actions
Gradle Precommit / gradle precommit w/ Java 11 (pull_request) Successful in 21m
Details
@muse-dev
musedev — Complete (29 min, 10/11 checks) no new bugs found
Details
This branch has no conflicts with the base branch
Merging can be performed automatically.
{noformat}

This won't happen again!

> Learning To Rank model upload fails generically
> ---
>
> Key: SOLR-15149
> URL: https://issues.apache.org/jira/browse/SOLR-15149
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - LTR
>Reporter: Alessandro Benedetti
>Assignee: Alessandro Benedetti
>Priority: Major
> Fix For: 8.9
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When uploading a model, using a not existent store or other incorrect 
> parameters you get:
> "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","java.lang.ClassCastException"],
> "msg":"org.apache.solr.ltr.model.ModelException: Model type does not 
> exist org.apache.solr.ltr.model.LinearModel",
> "code":400}}
> In the response, logs don't help that much out of the box, I had to go for 
> remote debugging and of course we don't want the generic user to do that.
> Reason is in org/apache/solr/ltr/model/LTRScoringModel.java:111
> {code:java}
> try {
>   // create an instance of the model
>   model = solrResourceLoader.newInstance(
>   className,
>   LTRScoringModel.class,
>   new String[0], // no sub packages
>   new Class[] { String.class, List.class, List.class, String.class, 
> List.class, Map.class },
>   new Object[] { name, features, norms, featureStoreName, 
> allFeatures, params });
>   if (params != null) {
> SolrPluginUtils.invokeSetters(model, params.entrySet());
>   }
> } catch (final Exception e) {
>   throw new ModelException("Model type does not exist " + className, e);
> }
> {code}
> This happens when:
> - use a not existent feature store
> - use not existent feature
> - use an integer instead of Double as a weight in a linear model
> unless any objection, we should improve such message with the real one



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15149) Learning To Rank model upload fails generically

2021-02-16 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17285408#comment-17285408
 ] 

Alessandro Benedetti commented on SOLR-15149:
-

Fixed, I'll monitor it this evening.

[~hossman] thanks for letting me know, I was navigating the Jenkins Jungle, 
could you point me to the report page where I can see the branch (master and 
8.x) test failures?

I was looking at :
https://ci-builds.apache.org/job/Lucene/job/Solr-Artifacts-master/
or
https://ci-builds.apache.org/job/Lucene/job/Lucene-Solr-SmokeRelease-8.x/118/

But it doesn't offer a quick summary of the failed tests, should we just take a 
look at the console output section?

> Learning To Rank model upload fails generically
> ---
>
> Key: SOLR-15149
> URL: https://issues.apache.org/jira/browse/SOLR-15149
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - LTR
>Reporter: Alessandro Benedetti
>Assignee: Alessandro Benedetti
>Priority: Major
> Fix For: 8.9
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When uploading a model, using a not existent store or other incorrect 
> parameters you get:
> "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","java.lang.ClassCastException"],
> "msg":"org.apache.solr.ltr.model.ModelException: Model type does not 
> exist org.apache.solr.ltr.model.LinearModel",
> "code":400}}
> In the response, logs don't help that much out of the box, I had to go for 
> remote debugging and of course we don't want the generic user to do that.
> Reason is in org/apache/solr/ltr/model/LTRScoringModel.java:111
> {code:java}
> try {
>   // create an instance of the model
>   model = solrResourceLoader.newInstance(
>   className,
>   LTRScoringModel.class,
>   new String[0], // no sub packages
>   new Class[] { String.class, List.class, List.class, String.class, 
> List.class, Map.class },
>   new Object[] { name, features, norms, featureStoreName, 
> allFeatures, params });
>   if (params != null) {
> SolrPluginUtils.invokeSetters(model, params.entrySet());
>   }
> } catch (final Exception e) {
>   throw new ModelException("Model type does not exist " + className, e);
> }
> {code}
> This happens when:
> - use a not existent feature store
> - use not existent feature
> - use an integer instead of Double as a weight in a linear model
> unless any objection, we should improve such message with the real one



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15149) Learning To Rank model upload fails generically

2021-02-16 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17285402#comment-17285402
 ] 

Alessandro Benedetti commented on SOLR-15149:
-

My bad, when I merged, I didn't re-run the test locally (as I noticed the build 
associated to the PR to be green).
I assume the green build on the Github PR doesn't mean tests run, I am fixing 
it now

> Learning To Rank model upload fails generically
> ---
>
> Key: SOLR-15149
> URL: https://issues.apache.org/jira/browse/SOLR-15149
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - LTR
>Reporter: Alessandro Benedetti
>Assignee: Alessandro Benedetti
>Priority: Major
> Fix For: 8.9
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When uploading a model, using a not existent store or other incorrect 
> parameters you get:
> "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","java.lang.ClassCastException"],
> "msg":"org.apache.solr.ltr.model.ModelException: Model type does not 
> exist org.apache.solr.ltr.model.LinearModel",
> "code":400}}
> In the response, logs don't help that much out of the box, I had to go for 
> remote debugging and of course we don't want the generic user to do that.
> Reason is in org/apache/solr/ltr/model/LTRScoringModel.java:111
> {code:java}
> try {
>   // create an instance of the model
>   model = solrResourceLoader.newInstance(
>   className,
>   LTRScoringModel.class,
>   new String[0], // no sub packages
>   new Class[] { String.class, List.class, List.class, String.class, 
> List.class, Map.class },
>   new Object[] { name, features, norms, featureStoreName, 
> allFeatures, params });
>   if (params != null) {
> SolrPluginUtils.invokeSetters(model, params.entrySet());
>   }
> } catch (final Exception e) {
>   throw new ModelException("Model type does not exist " + className, e);
> }
> {code}
> This happens when:
> - use a not existent feature store
> - use not existent feature
> - use an integer instead of Double as a weight in a linear model
> unless any objection, we should improve such message with the real one



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-12676) Improve details on ModelException when the feature of a model has not been defined

2021-02-15 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284869#comment-17284869
 ] 

Alessandro Benedetti commented on SOLR-12676:
-

This issue has been resolved in SOLR-15149, please double check and resolve 
this!
8.9.0 the expected fix version

> Improve details on ModelException when the feature of a model has not been 
> defined
> --
>
> Key: SOLR-12676
> URL: https://issues.apache.org/jira/browse/SOLR-12676
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - LTR
>Reporter: Steven Spasbo
>Priority: Major
>
> While trying to create a model definition, I was getting back the response:
> {code}
> {
>   "responseHeader":{
>   [...]
>   "error":{
> [...]
> "msg":"org.apache.solr.ltr.model.ModelException: Model type does not 
> exist org.apache.solr.ltr.model.LinearModel",
> "code":400}
> }
> }
> {code}
> I initially thought this was related to the library, but after a while 
> figured out that I had forgotten to create a feature in my feature store. 
> After creating that the model was created as expected. 
> To recreate this:
> {code}
> curl -XPOST -H 'Content-Type: application/json' \
> -d '{
>   "store" : "myStore",
>   "name" : "myModel",
>   "class" : "org.apache.solr.ltr.model.LinearModel",
>   "features" : [{
> "name" : "nonExistentFeature"
>   }],
>   "params" : {
> "nonExistentFeature" : 1.0
>   }
> }' http://localhost:8983/solr/$CORE/schema/model-store
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-12367) When adding a model referencing a non-existent feature the error message is very ambiguous

2021-02-15 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284868#comment-17284868
 ] 

Alessandro Benedetti commented on SOLR-12367:
-

This issue has been resolved in SOLR-15149, please double check and resolve 
this!
8.9.0 the expected fix version

> When adding a model referencing a non-existent feature the error message is 
> very ambiguous
> --
>
> Key: SOLR-12367
> URL: https://issues.apache.org/jira/browse/SOLR-12367
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - LTR
>Affects Versions: 7.3.1
>Reporter: Georg Sorst
>Priority: Minor
> Attachments: SOLR-12367.patch, SOLR-12367.patch, SOLR-12367.patch
>
>
> When adding a model that references a non-existent feature a very ambiguous 
> error message is thrown, something like "Model type does not exist 
> org.apache.solr.ltr.model.{{LinearModel}}".
>  
> To reproduce, do not add any features and just add a model, for example by 
> doing this:
>  
> {{curl -XPUT 'http://localhost:8983/solr/gettingstarted/schema/model-store' 
> --data-binary '}}
> {
> {{  "class": "org.apache.solr.ltr.model.LinearModel",}}
> {{  "name": "myModel",}}
> {{  "features": [ \{"name": "whatever" }],}}
> {{  "params": {"weights": {"whatever": 1.0
> {{}' -H 'Content-type:application/json'}}
>  
> The resulting error message "Model type does not exist 
> {{org.apache.solr.ltr.model.LinearModel" is extremely misleading and cost me 
> a while to figure out the actual cause.}}
>  
> A more suitable error message should probably indicate the name of the 
> missing feature that the model is trying to reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-11137) LTR: misleading error message when loading a model

2021-02-15 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284863#comment-17284863
 ] 

Alessandro Benedetti commented on SOLR-11137:
-

[~diegoceccarelli] this issue has been addressed on SOLR-15149 (in the linked 
issues), when you have time, can you double-check the fix covers all the cases 
you experienced?
Feel free to resolve the ticket then.

Cheers

> LTR: misleading error message when loading a model
> --
>
> Key: SOLR-11137
> URL: https://issues.apache.org/jira/browse/SOLR-11137
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - LTR
>Reporter: Diego Ceccarelli
>Priority: Minor
>
> Loading a model can fail for several reasons when calling the model 
> constructor, but the error message always reports that the Model type does 
> not exist.
> https://github.com/apache/lucene-solr/blob/master/solr/contrib/ltr/src/java/org/apache/solr/ltr/model/LTRScoringModel.java#L103



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15149) Learning To Rank model upload fails generically

2021-02-15 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284850#comment-17284850
 ] 

Alessandro Benedetti commented on SOLR-15149:
-

Contribution has been committed, closing this issue and related.

> Learning To Rank model upload fails generically
> ---
>
> Key: SOLR-15149
> URL: https://issues.apache.org/jira/browse/SOLR-15149
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - LTR
>Reporter: Alessandro Benedetti
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When uploading a model, using a not existent store or other incorrect 
> parameters you get:
> "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","java.lang.ClassCastException"],
> "msg":"org.apache.solr.ltr.model.ModelException: Model type does not 
> exist org.apache.solr.ltr.model.LinearModel",
> "code":400}}
> In the response, logs don't help that much out of the box, I had to go for 
> remote debugging and of course we don't want the generic user to do that.
> Reason is in org/apache/solr/ltr/model/LTRScoringModel.java:111
> {code:java}
> try {
>   // create an instance of the model
>   model = solrResourceLoader.newInstance(
>   className,
>   LTRScoringModel.class,
>   new String[0], // no sub packages
>   new Class[] { String.class, List.class, List.class, String.class, 
> List.class, Map.class },
>   new Object[] { name, features, norms, featureStoreName, 
> allFeatures, params });
>   if (params != null) {
> SolrPluginUtils.invokeSetters(model, params.entrySet());
>   }
> } catch (final Exception e) {
>   throw new ModelException("Model type does not exist " + className, e);
> }
> {code}
> This happens when:
> - use a not existent feature store
> - use not existent feature
> - use an integer instead of Double as a weight in a linear model
> unless any objection, we should improve such message with the real one



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-15149) Learning To Rank model upload fails generically

2021-02-15 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti resolved SOLR-15149.
-
Fix Version/s: 8.9
   Resolution: Fixed

> Learning To Rank model upload fails generically
> ---
>
> Key: SOLR-15149
> URL: https://issues.apache.org/jira/browse/SOLR-15149
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - LTR
>Reporter: Alessandro Benedetti
>Priority: Major
> Fix For: 8.9
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When uploading a model, using a not existent store or other incorrect 
> parameters you get:
> "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","java.lang.ClassCastException"],
> "msg":"org.apache.solr.ltr.model.ModelException: Model type does not 
> exist org.apache.solr.ltr.model.LinearModel",
> "code":400}}
> In the response, logs don't help that much out of the box, I had to go for 
> remote debugging and of course we don't want the generic user to do that.
> Reason is in org/apache/solr/ltr/model/LTRScoringModel.java:111
> {code:java}
> try {
>   // create an instance of the model
>   model = solrResourceLoader.newInstance(
>   className,
>   LTRScoringModel.class,
>   new String[0], // no sub packages
>   new Class[] { String.class, List.class, List.class, String.class, 
> List.class, Map.class },
>   new Object[] { name, features, norms, featureStoreName, 
> allFeatures, params });
>   if (params != null) {
> SolrPluginUtils.invokeSetters(model, params.entrySet());
>   }
> } catch (final Exception e) {
>   throw new ModelException("Model type does not exist " + className, e);
> }
> {code}
> This happens when:
> - use a not existent feature store
> - use not existent feature
> - use an integer instead of Double as a weight in a linear model
> unless any objection, we should improve such message with the real one



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15149) Learning To Rank model upload fails generically

2021-02-11 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282913#comment-17282913
 ] 

Alessandro Benedetti commented on SOLR-15149:
-

Sorry [~dweiss]! You just reverted a commit on that class, for that reason you 
were in the list of committers :)
you can ignore my tag then!

> Learning To Rank model upload fails generically
> ---
>
> Key: SOLR-15149
> URL: https://issues.apache.org/jira/browse/SOLR-15149
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alessandro Benedetti
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When uploading a model, using a not existent store or other incorrect 
> parameters you get:
> "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","java.lang.ClassCastException"],
> "msg":"org.apache.solr.ltr.model.ModelException: Model type does not 
> exist org.apache.solr.ltr.model.LinearModel",
> "code":400}}
> In the response, logs don't help that much out of the box, I had to go for 
> remote debugging and of course we don't want the generic user to do that.
> Reason is in org/apache/solr/ltr/model/LTRScoringModel.java:111
> {code:java}
> try {
>   // create an instance of the model
>   model = solrResourceLoader.newInstance(
>   className,
>   LTRScoringModel.class,
>   new String[0], // no sub packages
>   new Class[] { String.class, List.class, List.class, String.class, 
> List.class, Map.class },
>   new Object[] { name, features, norms, featureStoreName, 
> allFeatures, params });
>   if (params != null) {
> SolrPluginUtils.invokeSetters(model, params.entrySet());
>   }
> } catch (final Exception e) {
>   throw new ModelException("Model type does not exist " + className, e);
> }
> {code}
> This happens when:
> - use a not existent feature store
> - use not existent feature
> - use an integer instead of Double as a weight in a linear model
> unless any objection, we should improve such message with the real one



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15149) Learning To Rank model upload fails generically

2021-02-10 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282744#comment-17282744
 ] 

Alessandro Benedetti commented on SOLR-15149:
-

unless any objection from  [~erickerickson], [~dawid.weiss] [~cpoerschke] I'll 
proceed to the merge in the next couple of days

> Learning To Rank model upload fails generically
> ---
>
> Key: SOLR-15149
> URL: https://issues.apache.org/jira/browse/SOLR-15149
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alessandro Benedetti
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When uploading a model, using a not existent store or other incorrect 
> parameters you get:
> "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","java.lang.ClassCastException"],
> "msg":"org.apache.solr.ltr.model.ModelException: Model type does not 
> exist org.apache.solr.ltr.model.LinearModel",
> "code":400}}
> In the response, logs don't help that much out of the box, I had to go for 
> remote debugging and of course we don't want the generic user to do that.
> Reason is in org/apache/solr/ltr/model/LTRScoringModel.java:111
> {code:java}
> try {
>   // create an instance of the model
>   model = solrResourceLoader.newInstance(
>   className,
>   LTRScoringModel.class,
>   new String[0], // no sub packages
>   new Class[] { String.class, List.class, List.class, String.class, 
> List.class, Map.class },
>   new Object[] { name, features, norms, featureStoreName, 
> allFeatures, params });
>   if (params != null) {
> SolrPluginUtils.invokeSetters(model, params.entrySet());
>   }
> } catch (final Exception e) {
>   throw new ModelException("Model type does not exist " + className, e);
> }
> {code}
> This happens when:
> - use a not existent feature store
> - use not existent feature
> - use an integer instead of Double as a weight in a linear model
> unless any objection, we should improve such message with the real one



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-15149) Learning To Rank model upload fails generically

2021-02-10 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated SOLR-15149:

Description: 
When uploading a model, using a not existent store or other incorrect 
parameters you get:

"error":{
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","java.lang.ClassCastException"],
"msg":"org.apache.solr.ltr.model.ModelException: Model type does not exist 
org.apache.solr.ltr.model.LinearModel",
"code":400}}

In the response, logs don't help that much out of the box, I had to go for 
remote debugging and of course we don't want the generic user to do that.

Reason is in org/apache/solr/ltr/model/LTRScoringModel.java:111


{code:java}
try {
  // create an instance of the model
  model = solrResourceLoader.newInstance(
  className,
  LTRScoringModel.class,
  new String[0], // no sub packages
  new Class[] { String.class, List.class, List.class, String.class, 
List.class, Map.class },
  new Object[] { name, features, norms, featureStoreName, allFeatures, 
params });
  if (params != null) {
SolrPluginUtils.invokeSetters(model, params.entrySet());
  }
} catch (final Exception e) {
  throw new ModelException("Model type does not exist " + className, e);
}
{code}

This happens when:
- use a not existent feature store
- use not existent feature
- use an integer instead of Double as a weight in a linear model

unless any objection, we should improve such message with the real one

  was:
When uploading a model, using a not existent store or other incorrect 
parameters you get:

"error":{
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","java.lang.ClassCastException"],
"msg":"org.apache.solr.ltr.model.ModelException: Model type does not exist 
org.apache.solr.ltr.model.LinearModel",
"code":400}}

In the response, logs don't help that much out of the box, I had to go for 
remote debugging and of course we don't want the generic user to do that.

Reason is in org/apache/solr/ltr/model/LTRScoringModel.java:111


{code:java}
try {
  // create an instance of the model
  model = solrResourceLoader.newInstance(
  className,
  LTRScoringModel.class,
  new String[0], // no sub packages
  new Class[] { String.class, List.class, List.class, String.class, 
List.class, Map.class },
  new Object[] { name, features, norms, featureStoreName, allFeatures, 
params });
  if (params != null) {
SolrPluginUtils.invokeSetters(model, params.entrySet());
  }
} catch (final Exception e) {
  throw new ModelException("Model type does not exist " + className, e);
}
{code}

unless any objection, we should improve such message with the real one


> Learning To Rank model upload fails generically
> ---
>
> Key: SOLR-15149
> URL: https://issues.apache.org/jira/browse/SOLR-15149
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alessandro Benedetti
>Priority: Major
>
> When uploading a model, using a not existent store or other incorrect 
> parameters you get:
> "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","java.lang.ClassCastException"],
> "msg":"org.apache.solr.ltr.model.ModelException: Model type does not 
> exist org.apache.solr.ltr.model.LinearModel",
> "code":400}}
> In the response, logs don't help that much out of the box, I had to go for 
> remote debugging and of course we don't want the generic user to do that.
> Reason is in org/apache/solr/ltr/model/LTRScoringModel.java:111
> {code:java}
> try {
>   // create an instance of the model
>   model = solrResourceLoader.newInstance(
>   className,
>   LTRScoringModel.class,
>   new String[0], // no sub packages
>   new Class[] { String.class, List.class, List.class, String.class, 
> List.class, Map.class },
>   new Object[] { name, features, norms, featureStoreName, 
> allFeatures, params });
>   if (params != null) {
> SolrPluginUtils.invokeSetters(model, params.entrySet());
>   }
> } catch (final Exception e) {
>   throw new ModelException("Model type does not exist " + className, e);
> }
> {code}
> This happens when:
> - use a not existent feature store
> - use not existent feature
> - use an integer instead of Double as a weight in a linear model
> unless any objection, we should improve such message with the real one



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To 

[jira] [Created] (SOLR-15149) Learning To Rank model upload fails generically

2021-02-10 Thread Alessandro Benedetti (Jira)
Alessandro Benedetti created SOLR-15149:
---

 Summary: Learning To Rank model upload fails generically
 Key: SOLR-15149
 URL: https://issues.apache.org/jira/browse/SOLR-15149
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Alessandro Benedetti


When uploading a model, using a not existent store or other incorrect 
parameters you get:

"error":{
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","java.lang.ClassCastException"],
"msg":"org.apache.solr.ltr.model.ModelException: Model type does not exist 
org.apache.solr.ltr.model.LinearModel",
"code":400}}

In the response, logs don't help that much out of the box, I had to go for 
remote debugging and of course we don't want the generic user to do that.

Reason is in org/apache/solr/ltr/model/LTRScoringModel.java:111


{code:java}
try {
  // create an instance of the model
  model = solrResourceLoader.newInstance(
  className,
  LTRScoringModel.class,
  new String[0], // no sub packages
  new Class[] { String.class, List.class, List.class, String.class, 
List.class, Map.class },
  new Object[] { name, features, norms, featureStoreName, allFeatures, 
params });
  if (params != null) {
SolrPluginUtils.invokeSetters(model, params.entrySet());
  }
} catch (final Exception e) {
  throw new ModelException("Model type does not exist " + className, e);
}
{code}

unless any objection, we should improve such message with the real one



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-15112) SolrJ DocumentObjectBinder.toSolrInputDocument null handling

2021-01-27 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated SOLR-15112:

Description: 
Currently the:
org.apache.solr.client.solrj.beans.DocumentObjectBinder#toSolrInputDocument
method doesn't handle nulls in java objects very well.

Even if the field is null in the Java Object, the binder adds the field(with 
the null value) to the SolrInputDocument.

This may cause issues down the line, for example using UpdateRequestProcessors 
such as the UUIDUpdateProcessorFactory (which doesn't check the value of the 
field, but it just checks if a field is present)

The proposal here is to make the binder NOT add null fields to the 
SolrInputDocument.

Any objection is welcome (took this list of committers from some of the latest 
contributors to the class):
[~noble] [~noble.paul] [~erick][~erickerickson][~jpountz]

  was:
Currently the:
org.apache.solr.client.solrj.beans.DocumentObjectBinder#toSolrInputDocument
method doesn't handle nulls in java objects very well.

Even if the field is null in the Java Object, the binder adds the field(with 
the null value) to the SolrInputDocument.

This may cause issues down the line, for example using UpdateRequestProcessors 
such as the UUIDUpdateProcessorFactory (which doesn't check the value of the 
field, but it just checks if a field is present)

The proposal here is to make the binder NOT add null fields to the 
SolrInputDocument.

Any objection is welcome


> SolrJ DocumentObjectBinder.toSolrInputDocument null handling
> 
>
> Key: SOLR-15112
> URL: https://issues.apache.org/jira/browse/SOLR-15112
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrJ
>Affects Versions: 8.7
>Reporter: Alessandro Benedetti
>Priority: Minor
> Attachments: Screenshot 2021-01-27 at 16.07.13.png
>
>
> Currently the:
> org.apache.solr.client.solrj.beans.DocumentObjectBinder#toSolrInputDocument
> method doesn't handle nulls in java objects very well.
> Even if the field is null in the Java Object, the binder adds the field(with 
> the null value) to the SolrInputDocument.
> This may cause issues down the line, for example using 
> UpdateRequestProcessors such as the UUIDUpdateProcessorFactory (which doesn't 
> check the value of the field, but it just checks if a field is present)
> The proposal here is to make the binder NOT add null fields to the 
> SolrInputDocument.
> Any objection is welcome (took this list of committers from some of the 
> latest contributors to the class):
> [~noble] [~noble.paul] [~erick][~erickerickson][~jpountz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15112) SolrJ DocumentObjectBinder.toSolrInputDocument null handling

2021-01-27 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272960#comment-17272960
 ] 

Alessandro Benedetti commented on SOLR-15112:
-

Culprit:


{code:java}
else {
if (field.child != null) {
  addChild(obj, field, doc);
} else {
  doc.setField(field.name, field.get(obj));
}
{code}

In the previous if conditional a null field is skipped.
But in here there's no null check anymore.


> SolrJ DocumentObjectBinder.toSolrInputDocument null handling
> 
>
> Key: SOLR-15112
> URL: https://issues.apache.org/jira/browse/SOLR-15112
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrJ
>Affects Versions: 8.7
>Reporter: Alessandro Benedetti
>Priority: Minor
> Attachments: Screenshot 2021-01-27 at 16.07.13.png
>
>
> Currently the:
> org.apache.solr.client.solrj.beans.DocumentObjectBinder#toSolrInputDocument
> method doesn't handle nulls in java objects very well.
> Even if the field is null in the Java Object, the binder adds the field(with 
> the null value) to the SolrInputDocument.
> This may cause issues down the line, for example using 
> UpdateRequestProcessors such as the UUIDUpdateProcessorFactory (which doesn't 
> check the value of the field, but it just checks if a field is present)
> The proposal here is to make the binder NOT add null fields to the 
> SolrInputDocument.
> Any objection is welcome



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-15112) SolrJ DocumentObjectBinder.toSolrInputDocument null handling

2021-01-27 Thread Alessandro Benedetti (Jira)
Alessandro Benedetti created SOLR-15112:
---

 Summary: SolrJ DocumentObjectBinder.toSolrInputDocument null 
handling
 Key: SOLR-15112
 URL: https://issues.apache.org/jira/browse/SOLR-15112
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SolrJ
Affects Versions: 8.7
Reporter: Alessandro Benedetti
 Attachments: Screenshot 2021-01-27 at 16.07.13.png

Currently the:
org.apache.solr.client.solrj.beans.DocumentObjectBinder#toSolrInputDocument
method doesn't handle nulls in java objects very well.

Even if the field is null in the Java Object, the binder adds the field(with 
the null value) to the SolrInputDocument.

This may cause issues down the line, for example using UpdateRequestProcessors 
such as the UUIDUpdateProcessorFactory (which doesn't check the value of the 
field, but it just checks if a field is present)

The proposal here is to make the binder NOT add null fields to the 
SolrInputDocument.

Any objection is welcome



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-12-09 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246724#comment-17246724
 ] 

Alessandro Benedetti commented on LUCENE-9136:
--

Now that https://issues.apache.org/jira/browse/LUCENE-9322 has been resolved, 
what's remaining in this issue to be merged?

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: glove-100-angular.png, glove-25-angular.png, 
> image-2020-03-07-01-22-06-132.png, image-2020-03-07-01-25-58-047.png, 
> image-2020-03-07-01-27-12-859.png, sift-128-euclidean.png
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14397) Vector Search in Solr

2020-12-08 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17245867#comment-17245867
 ] 

Alessandro Benedetti commented on SOLR-14397:
-

Should we resume this work, now that 
https://issues.apache.org/jira/browse/LUCENE-9004 has been officially merged to 
master?
I read it superficially and I have not yet explored the code, but the 
aforementioned contribution seems quite relevant, potentially is now the right 
time to redefine the design?

> Vector Search in Solr
> -
>
> Key: SOLR-14397
> URL: https://issues.apache.org/jira/browse/SOLR-14397
> Project: Solr
>  Issue Type: Improvement
>Reporter: Trey Grainger
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Search engines have traditionally relied upon token-based matching (typically 
> keywords) on an inverted index, plus relevance ranking based upon keyword 
> occurrence statistics. This can be viewed as a "sparse vector” match (where 
> each term is a one-hot encoded dimension in the vector), since only a few 
> keywords out of all possible keywords are considered in each query. With the 
> introduction of deep-learning-based transformers over the last few years, 
> however, the state of the art in relevance has moved to ranking models based 
> upon dense vectors that encode a latent, semantic understanding of both 
> language constructs and the underlying domain upon which the model was 
> trained. These dense vectors are also referred to as “embeddings”. An example 
> of this kind of embedding would be taking the phrase “chief executive officer 
> of the tech company” and converting it to [0.03, 1.7, 9.12, 0, 0.3]
>  . Other similar phrases should encode to vectors with very similar numbers, 
> so we may expect a query like “CEO of a technology org” to generate a vector 
> like [0.1, 1.9, 8.9, 0.1, 0.4]. When performing a cosine similarity 
> calculation between these vectors, we would expect a number closer to 1.0, 
> whereas a very unrelated text blurb would generate a much smaller cosine 
> similarity.
> This is a proposal for how we should implement these vector search 
> capabilities in Solr.
> h1. Search Process Overview:
> In order to implement dense vector search, the following process is typically 
> followed:
> h2. Offline:
> An encoder is built. An encoder can take in text (a query, a sentence, a 
> paragraph, a document, etc.) and return a dense vector representing that 
> document in a rich semantic space. The semantic space is learned from 
> training on textual data (usually, though other sources work, too), typically 
> from the domain of the search engine.
> h2. Document Ingestion:
> When documents are processed, they are passed to the encoder, and the dense 
> vector(s) returned are stored as fields on the document. There could be one 
> or more vectors per-document, as the granularity of the vectors could be 
> per-document, per field, per paragraph, per-sentence, or even per phrase or 
> per term.
> h2. Query Time:
> *Encoding:* The query is translated to a dense vector by passing it to the 
> encoder
>  Quantization: The query is quantized. Quantization is the process of taking 
> a vector with many values and turning it into “terms” in a vector space that 
> approximates the full vector space of the dense vectors.
>  *ANN Matching:* A query on the quantized vector tokens is executed as an ANN 
> (approximate nearest neighbor) search. This allows finding most of the best 
> matching documents (typically up to 95%) with a traditional and efficient 
> lookup against the inverted index.
>  _(optional)_ *ANN Ranking*: ranking may be performed based upon the matched 
> quantized tokens to get a rough, initial ranking of documents based upon the 
> similarity of the query and document vectors. This allows the next step 
> (re-ranking) to be performed on a smaller subset of documents. 
>  *Re-Ranking:* Once the initial matching (and optionally ANN ranking) is 
> performed, a similarity calculation (cosine, dot-product, or any number of 
> other calculations) is typically performed between the full (non-quantized) 
> dense vectors for the query and those in the document. This re-ranking will 
> typically be on the top-N results for performance reasons.
>  *Return Results:* As with any search, the final step is typically to return 
> the results in relevance-ranked order. In this case, that would be sorted by 
> the re-ranking similarity score (i.e. “cosine descending”).
>  --
> *Variant:* For small document sets, it may be preferable to rank all 
> documents and skip steps steps 2, 3, and 4. This is because ANN Matching 
> typically reduces recall (current state of the art is around 95% recall), so 
> it can be beneficial to rank all documents if performance is not a concern. 
> In this case, step 5 

[jira] [Updated] (SOLR-15015) Add support for Interleaving Algorithm parameter in Learning To Rank

2020-11-24 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated SOLR-15015:

Fix Version/s: 8.8
   master (9.0)

> Add support for Interleaving Algorithm parameter in Learning To Rank
> 
>
> Key: SOLR-15015
> URL: https://issues.apache.org/jira/browse/SOLR-15015
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - LTR
>Reporter: Alessandro Benedetti
>Priority: Major
> Fix For: master (9.0), 8.8
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Interleaving has been contributed with SOLR-14560 and it now supports just 
> one algorithm ( Team Draft)
> To facilitate contributions of new algorithm the scope of this issue is to 
> support a new parameter : 'interleavingAlgorithm' (tentative)
> Default value will be team draft interleaving.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-15015) Add support for Interleaving Algorithm parameter in Learning To Rank

2020-11-24 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti resolved SOLR-15015.
-
Resolution: Fixed

Merged in master / 8.x branch

> Add support for Interleaving Algorithm parameter in Learning To Rank
> 
>
> Key: SOLR-15015
> URL: https://issues.apache.org/jira/browse/SOLR-15015
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - LTR
>Reporter: Alessandro Benedetti
>Priority: Major
> Fix For: master (9.0), 8.8
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Interleaving has been contributed with SOLR-14560 and it now supports just 
> one algorithm ( Team Draft)
> To facilitate contributions of new algorithm the scope of this issue is to 
> support a new parameter : 'interleavingAlgorithm' (tentative)
> Default value will be team draft interleaving.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-15015) Add support for Interleaving Algorithm parameter in Learning To Rank

2020-11-23 Thread Alessandro Benedetti (Jira)
Alessandro Benedetti created SOLR-15015:
---

 Summary: Add support for Interleaving Algorithm parameter in 
Learning To Rank
 Key: SOLR-15015
 URL: https://issues.apache.org/jira/browse/SOLR-15015
 Project: Solr
  Issue Type: Task
  Security Level: Public (Default Security Level. Issues are Public)
  Components: contrib - LTR
Reporter: Alessandro Benedetti


Interleaving has been contributed with SOLR-14560 and it now supports just one 
algorithm ( Team Draft)

To facilitate contributions of new algorithm the scope of this issue is to 
support a new parameter : 'interleavingAlgorithm' (tentative)

Default value will be team draft interleaving.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14560) Learning To Rank Interleaving

2020-11-18 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated SOLR-14560:

Fix Version/s: 8.8
   master (9.0)

> Learning To Rank Interleaving
> -
>
> Key: SOLR-14560
> URL: https://issues.apache.org/jira/browse/SOLR-14560
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - LTR
>Affects Versions: 8.5.2
>Reporter: Alessandro Benedetti
>Priority: Minor
> Fix For: master (9.0), 8.8
>
>  Time Spent: 10h 10m
>  Remaining Estimate: 0h
>
> Interleaving is an approach to Online Search Quality evaluation that can be 
> very useful for Learning To Rank models:
> [https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html|https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html]
> Scope of this issue is to introduce the ability to the LTR query parser of 
> accepting multiple models (2 to start with).
> If one model is passed, normal reranking happens.
> If two models are passed, reranking happens for both models and the final 
> reranked list is the interleaved sequence of results coming from the two 
> models lists.
> As a first step it is going to be implemented through:
> TeamDraft Interleaving with two models in input.
> In the future, we can expand the functionality adding the interleaving 
> algorithm as a parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14560) Learning To Rank Interleaving

2020-11-18 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti resolved SOLR-14560.
-
Resolution: Fixed

Thanks [~cpoerschke] for all the reviewing help!

This new feature is now merged in master and 8.x (upcoming 8.8.0 )

> Learning To Rank Interleaving
> -
>
> Key: SOLR-14560
> URL: https://issues.apache.org/jira/browse/SOLR-14560
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - LTR
>Affects Versions: 8.5.2
>Reporter: Alessandro Benedetti
>Priority: Minor
>  Time Spent: 10h 10m
>  Remaining Estimate: 0h
>
> Interleaving is an approach to Online Search Quality evaluation that can be 
> very useful for Learning To Rank models:
> [https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html|https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html]
> Scope of this issue is to introduce the ability to the LTR query parser of 
> accepting multiple models (2 to start with).
> If one model is passed, normal reranking happens.
> If two models are passed, reranking happens for both models and the final 
> reranked list is the interleaved sequence of results coming from the two 
> models lists.
> As a first step it is going to be implemented through:
> TeamDraft Interleaving with two models in input.
> In the future, we can expand the functionality adding the interleaving 
> algorithm as a parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-14560) Learning To Rank Interleaving

2020-10-26 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163109#comment-17163109
 ] 

Alessandro Benedetti edited comment on SOLR-14560 at 10/26/20, 4:01 PM:


The review is now open.
The pull request is in a much better state and it is ready to be reviewed.
I will add a few more tests and respond to review feedbacks.
Then, once we have an acceptable patch I will proceed with the commit.


*Known Limitations*: no support in sharded mode yet


was (Author: alessandro.benedetti):
The review is not open.
The pull request is in a much better state and it is ready to be reviewed.
I will add a few more tests and respond to review feedbacks.
Then, once we have an acceptable patch I will proceed with the commit.


*Known Limitations*: no support in sharded mode yet

> Learning To Rank Interleaving
> -
>
> Key: SOLR-14560
> URL: https://issues.apache.org/jira/browse/SOLR-14560
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - LTR
>Affects Versions: 8.5.2
>Reporter: Alessandro Benedetti
>Priority: Minor
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Interleaving is an approach to Online Search Quality evaluation that can be 
> very useful for Learning To Rank models:
> [https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html|https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html]
> Scope of this issue is to introduce the ability to the LTR query parser of 
> accepting multiple models (2 to start with).
> If one model is passed, normal reranking happens.
> If two models are passed, reranking happens for both models and the final 
> reranked list is the interleaved sequence of results coming from the two 
> models lists.
> As a first step it is going to be implemented through:
> TeamDraft Interleaving with two models in input.
> In the future, we can expand the functionality adding the interleaving 
> algorithm as a parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14408) Refactor MoreLikeThisHandler Implementation

2020-10-07 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209470#comment-17209470
 ] 

Alessandro Benedetti commented on SOLR-14408:
-

I agree with [~dsmiley], [~nazerke] can you update the pull request to revert 
the class externalisation?
I believe an inner class would suffice in this case.

> Refactor MoreLikeThisHandler Implementation
> ---
>
> Key: SOLR-14408
> URL: https://issues.apache.org/jira/browse/SOLR-14408
> Project: Solr
>  Issue Type: Improvement
>  Components: MoreLikeThis
>Reporter: Nazerke Seidan
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The main goal of this refactoring is for readability and accessibility of 
> MoreLikeThisHandler class. Current MoreLikeThisHandler class consists of two 
> static subclasses and accessing them later in MoreLikeThisComponent.  I 
> propose to have them as separate public classes. 
> cc: [~abenedetti], as you have had the recent commit for MLT, what do you 
> think about this?  Anyway, the code is ready for review. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14408) Refactor MoreLikeThisHandler Implementation

2020-09-28 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203486#comment-17203486
 ] 

Alessandro Benedetti commented on SOLR-14408:
-

Hi [~Seidan], sorry for the abysmal delay in responding, I just got the chance 
to review your pull request.
It seems ok to me, only question is related the interesting term class.
Are we sure there isn't anything similar already in the Apache Lucene/Solr 
codebase ?
I took a quick look and I wasn't able to find it.

Let me know and we can progress with the merge,

Cheers


> Refactor MoreLikeThisHandler Implementation
> ---
>
> Key: SOLR-14408
> URL: https://issues.apache.org/jira/browse/SOLR-14408
> Project: Solr
>  Issue Type: Improvement
>  Components: MoreLikeThis
>Reporter: Nazerke Seidan
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The main goal of this refactoring is for readability and accessibility of 
> MoreLikeThisHandler class. Current MoreLikeThisHandler class consists of two 
> static subclasses and accessing them later in MoreLikeThisComponent.  I 
> propose to have them as separate public classes. 
> cc: [~abenedetti], as you have had the recent commit for MLT, what do you 
> think about this?  Anyway, the code is ready for review. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-14560) Learning To Rank Interleaving

2020-07-22 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163109#comment-17163109
 ] 

Alessandro Benedetti edited comment on SOLR-14560 at 7/22/20, 10:46 PM:


The review is not open.
The pull request is in a much better state and it is ready to be reviewed.
I will add a few more tests and respond to review feedbacks.
Then, once we have an acceptable patch I will proceed with the commit.


*Known Limitations*: no support in sharded mode yet


was (Author: alessandro.benedetti):
The review is not open.
The pull request is in a much better state and it is ready to be reviewed.
I will add a few more tests and respond to review feedbacks.
Then, once we have an acceptable patch I will proceed with the commit.

> Learning To Rank Interleaving
> -
>
> Key: SOLR-14560
> URL: https://issues.apache.org/jira/browse/SOLR-14560
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - LTR
>Affects Versions: 8.5.2
>Reporter: Alessandro Benedetti
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Interleaving is an approach to Online Search Quality evaluation that can be 
> very useful for Learning To Rank models:
> [https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html|https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html]
> Scope of this issue is to introduce the ability to the LTR query parser of 
> accepting multiple models (2 to start with).
> If one model is passed, normal reranking happens.
> If two models are passed, reranking happens for both models and the final 
> reranked list is the interleaved sequence of results coming from the two 
> models lists.
> As a first step it is going to be implemented through:
> TeamDraft Interleaving with two models in input.
> In the future, we can expand the functionality adding the interleaving 
> algorithm as a parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14560) Learning To Rank Interleaving

2020-07-22 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163109#comment-17163109
 ] 

Alessandro Benedetti commented on SOLR-14560:
-

The review is not open.
The pull request is in a much better state and it is ready to be reviewed.
I will add a few more tests and respond to review feedbacks.
Then, once we have an acceptable patch I will proceed with the commit.

> Learning To Rank Interleaving
> -
>
> Key: SOLR-14560
> URL: https://issues.apache.org/jira/browse/SOLR-14560
> Project: Solr
>  Issue Type: New Feature
>  Components: contrib - LTR
>Affects Versions: 8.5.2
>Reporter: Alessandro Benedetti
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Interleaving is an approach to Online Search Quality evaluation that can be 
> very useful for Learning To Rank models:
> [https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html|https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html]
> Scope of this issue is to introduce the ability to the LTR query parser of 
> accepting multiple models (2 to start with).
> If one model is passed, normal reranking happens.
> If two models are passed, reranking happens for both models and the final 
> reranked list is the interleaved sequence of results coming from the two 
> models lists.
> As a first step it is going to be implemented through:
> TeamDraft Interleaving with two models in input.
> In the future, we can expand the functionality adding the interleaving 
> algorithm as a parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14560) Learning To Rank Interleaving

2020-06-11 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133592#comment-17133592
 ] 

Alessandro Benedetti commented on SOLR-14560:
-

The draft is attached : 

[https://github.com/apache/lucene-solr/pull/1571|https://github.com/apache/lucene-solr/pull/1571]

Any comments on the architectural changes and the places I touched so far are 
more than welcome.

Bear in mind the task is still work in progress and changes/tests will happen, 
so in case you are curious and willing to leave a comment, take this into 
account.

Once ready for code review I will add a comment here a finalise the Pull 
Request from draft.
I will proceed to the merge with at least another committer approval.

I tag all the people that worked on Learning To Rank, in no particular order:

[~cpoerschke] [~diegoceccarelli] [~mnilsson] 
[~jpantony][~jdorando][~nsanthapuri] [~dave1g]


> Learning To Rank Interleaving
> -
>
> Key: SOLR-14560
> URL: https://issues.apache.org/jira/browse/SOLR-14560
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - LTR
>Affects Versions: 8.5.2
>Reporter: Alessandro Benedetti
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Interleaving is an approach to Online Search Quality evaluation that can be 
> very useful for Learning To Rank models:
> [https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html|https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html]
> Scope of this issue is to introduce the ability to the LTR query parser of 
> accepting multiple models (2 to start with).
> If one model is passed, normal reranking happens.
> If two models are passed, reranking happens for both models and the final 
> reranked list is the interleaved sequence of results coming from the two 
> models lists.
> As a first step it is going to be implemented through:
> TeamDraft Interleaving with two models in input.
> In the future, we can expand the functionality adding the interleaving 
> algorithm as a parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14560) Learning To Rank Interleaving

2020-06-11 Thread Alessandro Benedetti (Jira)
Alessandro Benedetti created SOLR-14560:
---

 Summary: Learning To Rank Interleaving
 Key: SOLR-14560
 URL: https://issues.apache.org/jira/browse/SOLR-14560
 Project: Solr
  Issue Type: New Feature
  Security Level: Public (Default Security Level. Issues are Public)
  Components: contrib - LTR
Affects Versions: 8.5.2
Reporter: Alessandro Benedetti


Interleaving is an approach to Online Search Quality evaluation that can be 
very useful for Learning To Rank models:
[https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html|https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html]
Scope of this issue is to introduce the ability to the LTR query parser of 
accepting multiple models (2 to start with).

If one model is passed, normal reranking happens.
If two models are passed, reranking happens for both models and the final 
reranked list is the interleaved sequence of results coming from the two models 
lists.

As a first step it is going to be implemented through:
TeamDraft Interleaving with two models in input.

In the future, we can expand the functionality adding the interleaving 
algorithm as a parameter.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14408) Refactor MoreLikeThisHandler Implementation

2020-04-15 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083990#comment-17083990
 ] 

Alessandro Benedetti commented on SOLR-14408:
-

can you attach a Pull Request to review? happy to take a look.
I will actively work on More Like This refactor to make it more usable.

> Refactor MoreLikeThisHandler Implementation
> ---
>
> Key: SOLR-14408
> URL: https://issues.apache.org/jira/browse/SOLR-14408
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: MoreLikeThis
>Reporter: Nazerke Seidan
>Priority: Minor
>
> The main goal of this refactoring is for readability and accessibility of 
> MoreLikeThisHandler class. Current MoreLikeThisHandler class consists of two 
> static subclasses and accessing them later in MoreLikeThisComponent.  I 
> propose to have them as separate public classes. 
> cc: [~abenedetti], as you have had the recent commit for MLT, what do you 
> think about this?  Anyway, the code is ready for review. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-12238) Synonym Query Style Boost By Payload

2020-02-07 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032489#comment-17032489
 ] 

Alessandro Benedetti commented on SOLR-12238:
-

hi [~dsmiley], [~romseygeek], first of all, thank you again for your patience 
and very useful insights.
The child Lucene issue and pull request have been updated incorporating Alan's 
suggestions.


> Synonym Query Style Boost By Payload
> 
>
> Key: SOLR-12238
> URL: https://issues.apache.org/jira/browse/SOLR-12238
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Affects Versions: 7.2
>Reporter: Alessandro Benedetti
>Priority: Major
> Attachments: SOLR-12238.patch, SOLR-12238.patch, SOLR-12238.patch, 
> SOLR-12238.patch
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> This improvement is built on top of the Synonym Query Style feature and 
> brings the possibility of boosting synonym queries using the payload 
> associated.
> It introduces two new modalities for the Synonym Query Style :
> PICK_BEST_BOOST_BY_PAYLOAD -> build a Disjunction query with the clauses 
> boosted by payload
> AS_DISTINCT_TERMS_BOOST_BY_PAYLOAD -> build a Boolean query with the clauses 
> boosted by payload
> This new synonym query styles will assume payloads are available so they must 
> be used in conjunction with a token filter able to produce payloads.
> An synonym.txt example could be :
> # Synonyms used by Payload Boost
> tiger => tiger|1.0, Big_Cat|0.8, Shere_Khan|0.9
> leopard => leopard, Big_Cat|0.8, Bagheera|0.9
> lion => lion|1.0, panthera leo|0.99, Simba|0.8
> snow_leopard => panthera uncia|0.99, snow leopard|1.0
> A simple token filter to populate the payloads from such synonym.txt is :
>  delimiter="|"/>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9171) Synonyms Boost by Payload

2020-02-07 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032462#comment-17032462
 ] 

Alessandro Benedetti commented on LUCENE-9171:
--

hi [~romseygeek], first of all, thank you again for your patience and very 
useful insights.
I have incorporated your changes and cleaned everything up.
You find the original PR updated.

My un-resolved questions:

- boostAttribute doesn’t use BytesRef but directly float, is it a concern? We 
are expected to use it at query time, so we could actually see a query time 
minimal benefit in not encoding/decoding?

- you expressed concerns over SpanBoostQuery, mentioning they are sort of 
broken, what should we do in that regard? right now the create span query seems 
to work as expected with boosted synonyms(see the related test), I suspect if 
SpanBoostQuery are broken , they should get resolved in another ticket?

- from an original comment in the test code  
org.apache.solr.search.TestSolrQueryParser#testSynonymQueryStyle:
 "confirm autoGeneratePhraseQueries always builds OR queries"
I changed that, was there any reason for that?

> Synonyms Boost by Payload
> -
>
> Key: LUCENE-9171
> URL: https://issues.apache.org/jira/browse/LUCENE-9171
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/queryparser
>Reporter: Alessandro Benedetti
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I have been working in the additional capability of boosting queries by terms 
> payload through a parameter to enable it in Lucene Query Builder.
> This has been done targeting the Synonyms Query.
> It is parametric, so it meant to see no difference unless the feature is 
> enabled.
> Solr has its bits to comply thorugh its SynonymsQueryStyles



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9171) Synonyms Boost by Payload

2020-02-03 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028858#comment-17028858
 ] 

Alessandro Benedetti commented on LUCENE-9171:
--

Hi [~romseygeek], thanks for your feedback again, that is very kind of you.

But I am not fully convinced of moving back to terms and a separate array of 
boosts, that will be effectively ignored in Lucene and only used by 
implementations (such as Solr).

Furthermore, I move the conditional parameter to extract them from the payload 
info in Solr, but if we move back this to Lucene, it will be necessary to move 
back the conditional extraction in Lucene?

What are the cons of [~dsmiley]suggestion of generalise this to use the 
tokenstream instead?
In this way the Lucene signature will look all right as it uses the tokenstream 
and Solr will use the tokenstream to extract payload info if the conditional 
parameter is there.

What do you think?

> Synonyms Boost by Payload
> -
>
> Key: LUCENE-9171
> URL: https://issues.apache.org/jira/browse/LUCENE-9171
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/queryparser
>Reporter: Alessandro Benedetti
>Priority: Major
>
> I have been working in the additional capability of boosting queries by terms 
> payload through a parameter to enable it in Lucene Query Builder.
> This has been done targeting the Synonyms Query.
> It is parametric, so it meant to see no difference unless the feature is 
> enabled.
> Solr has its bits to comply thorugh its SynonymsQueryStyles



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9171) Synonyms Boost by Payload

2020-01-31 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17027366#comment-17027366
 ] 

Alessandro Benedetti commented on LUCENE-9171:
--

I would love to leverage the momentum and the fresh memory on this project to 
finalise the contribution, I didn't get any feedback on the TokenStream / 
AttributeSource controversy in the last few days...
If i don't get any additional feedback on that by mid next week, I'll take 
[~dsmiley] consideration and change the implementation toward the use of the 
TokenStream directly.
Super happy to receive additional feedback and considerations by [~romseygeek], 
maybe the AttributeSource approach was misunderstood?

> Synonyms Boost by Payload
> -
>
> Key: LUCENE-9171
> URL: https://issues.apache.org/jira/browse/LUCENE-9171
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/queryparser
>Reporter: Alessandro Benedetti
>Priority: Major
>
> I have been working in the additional capability of boosting queries by terms 
> payload through a parameter to enable it in Lucene Query Builder.
> This has been done targeting the Synonyms Query.
> It is parametric, so it meant to see no difference unless the feature is 
> enabled.
> Solr has its bits to comply thorugh its SynonymsQueryStyles



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-12238) Synonym Query Style Boost By Payload

2020-01-28 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025050#comment-17025050
 ] 

Alessandro Benedetti commented on SOLR-12238:
-

I followed the refactor comments from both @diegoceccarelli  and @romseygeek .
The PR seems much cleaner right now both Lucene and Solr side.
Copious tests are present and should cover the various situations.

Few questions remain:

- from a test I read a comment from @dsmiley  saying: "confirm 
autoGeneratePhraseQueries always builds OR queries" from 
org.apache.solr.search.TestSolrQueryParser#testSynonymQueryStyle

- what can we do for the SpanBoostQuery, I was completely not aware they are 
going to be deprecated

Let me know

> Synonym Query Style Boost By Payload
> 
>
> Key: SOLR-12238
> URL: https://issues.apache.org/jira/browse/SOLR-12238
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Affects Versions: 7.2
>Reporter: Alessandro Benedetti
>Priority: Major
> Attachments: SOLR-12238.patch, SOLR-12238.patch, SOLR-12238.patch, 
> SOLR-12238.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> This improvement is built on top of the Synonym Query Style feature and 
> brings the possibility of boosting synonym queries using the payload 
> associated.
> It introduces two new modalities for the Synonym Query Style :
> PICK_BEST_BOOST_BY_PAYLOAD -> build a Disjunction query with the clauses 
> boosted by payload
> AS_DISTINCT_TERMS_BOOST_BY_PAYLOAD -> build a Boolean query with the clauses 
> boosted by payload
> This new synonym query styles will assume payloads are available so they must 
> be used in conjunction with a token filter able to produce payloads.
> An synonym.txt example could be :
> # Synonyms used by Payload Boost
> tiger => tiger|1.0, Big_Cat|0.8, Shere_Khan|0.9
> leopard => leopard, Big_Cat|0.8, Bagheera|0.9
> lion => lion|1.0, panthera leo|0.99, Simba|0.8
> snow_leopard => panthera uncia|0.99, snow leopard|1.0
> A simple token filter to populate the payloads from such synonym.txt is :
>  delimiter="|"/>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9171) Synonyms Boost by Payload

2020-01-27 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024589#comment-17024589
 ] 

Alessandro Benedetti commented on LUCENE-9171:
--

Thanks [~romseygeek], your feedback has been extremely valuable.
I proceeded with the implementation.
the code is attached to the PR and it seems much cleaner to me now that I 
followed the AttributeSource approach.

Let me know,


> Synonyms Boost by Payload
> -
>
> Key: LUCENE-9171
> URL: https://issues.apache.org/jira/browse/LUCENE-9171
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/queryparser
>Reporter: Alessandro Benedetti
>Priority: Major
>
> I have been working in the additional capability of boosting queries by terms 
> payload through a parameter to enable it in Lucene Query Builder.
> This has been done targeting the Synonyms Query.
> It is parametric, so it meant to see no difference unless the feature is 
> enabled.
> Solr has its bits to comply thorugh its SynonymsQueryStyles



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9171) Synonyms Boost by Payload

2020-01-27 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024216#comment-17024216
 ] 

Alessandro Benedetti commented on LUCENE-9171:
--

Hi Alan, 
I'll work on this during the week.

I am investigating the usage of AttributeSources.
Now, first question, as you can see in the Pull Request I added a boolean class 
variable "synonymsBoostByPayload" at Lucene level :

{code:java}
 protected boolean enablePositionIncrements = true;
  protected boolean enableGraphQueries = true;
  protected boolean autoGenerateMultiTermSynonymsPhraseQuery = false;
  protected boolean synonymsBoostByPayload = false;
{code}

When you say:

" Instead, it would be useful to try and see what we need to change in 
QueryBuilder to allow all of the payload handling to happen in subclasses."

Do you mean it would be better to see that disappear and leave it only in the 
SolrQueryParserBase implementation?
Don't you think it could be useful to boost synonyms by payload at Lucene level 
as well as a new feature?
No strong opinion on that, just wondering :)

Related the span query observation, I'll think about it



> Synonyms Boost by Payload
> -
>
> Key: LUCENE-9171
> URL: https://issues.apache.org/jira/browse/LUCENE-9171
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/queryparser
>Reporter: Alessandro Benedetti
>Priority: Major
>
> I have been working in the additional capability of boosting queries by terms 
> payload through a parameter to enable it in Lucene Query Builder.
> This has been done targeting the Synonyms Query.
> It is parametric, so it meant to see no difference unless the feature is 
> enabled.
> Solr has its bits to comply thorugh its SynonymsQueryStyles



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-12238) Synonym Query Style Boost By Payload

2020-01-25 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023539#comment-17023539
 ] 

Alessandro Benedetti edited comment on SOLR-12238 at 1/25/20 1:05 PM:
--

PR have been hugely updated
I linked the related Lucene Jira, resolved all conflicts and re-engineered the 
approach hopefully in a cleaner way.
I tagged in the Lucene issue all the relevant people that worked on this in the 
last few years.

Solr tests have been hugely refined and next steps will be oriented to the 
Lucene Query Builder tests.


was (Author: alessandro.benedetti):
I linked the related Lucene Jira, resolved all conflicts and re-engineered the 
approach hopefully in a cleaner way.
I tagged in the Lucene issue all the relevant people that worked on this in the 
last few years.

Solr tests have been hugely refined and next steps will be oriented to the 
Lucene Query Builder tests.

> Synonym Query Style Boost By Payload
> 
>
> Key: SOLR-12238
> URL: https://issues.apache.org/jira/browse/SOLR-12238
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Affects Versions: 7.2
>Reporter: Alessandro Benedetti
>Priority: Major
> Attachments: SOLR-12238.patch, SOLR-12238.patch, SOLR-12238.patch, 
> SOLR-12238.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This improvement is built on top of the Synonym Query Style feature and 
> brings the possibility of boosting synonym queries using the payload 
> associated.
> It introduces two new modalities for the Synonym Query Style :
> PICK_BEST_BOOST_BY_PAYLOAD -> build a Disjunction query with the clauses 
> boosted by payload
> AS_DISTINCT_TERMS_BOOST_BY_PAYLOAD -> build a Boolean query with the clauses 
> boosted by payload
> This new synonym query styles will assume payloads are available so they must 
> be used in conjunction with a token filter able to produce payloads.
> An synonym.txt example could be :
> # Synonyms used by Payload Boost
> tiger => tiger|1.0, Big_Cat|0.8, Shere_Khan|0.9
> leopard => leopard, Big_Cat|0.8, Bagheera|0.9
> lion => lion|1.0, panthera leo|0.99, Simba|0.8
> snow_leopard => panthera uncia|0.99, snow leopard|1.0
> A simple token filter to populate the payloads from such synonym.txt is :
>  delimiter="|"/>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-12238) Synonym Query Style Boost By Payload

2020-01-25 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023539#comment-17023539
 ] 

Alessandro Benedetti edited comment on SOLR-12238 at 1/25/20 1:05 PM:
--

I linked the related Lucene Jira, resolved all conflicts and re-engineered the 
approach hopefully in a cleaner way.
I tagged in the Lucene issue all the relevant people that worked on this in the 
last few years.

Solr tests have been hugely refined and next steps will be oriented to the 
Lucene Query Builder tests.


was (Author: alessandro.benedetti):
The Lucene functionality is described here

> Synonym Query Style Boost By Payload
> 
>
> Key: SOLR-12238
> URL: https://issues.apache.org/jira/browse/SOLR-12238
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Affects Versions: 7.2
>Reporter: Alessandro Benedetti
>Priority: Major
> Attachments: SOLR-12238.patch, SOLR-12238.patch, SOLR-12238.patch, 
> SOLR-12238.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This improvement is built on top of the Synonym Query Style feature and 
> brings the possibility of boosting synonym queries using the payload 
> associated.
> It introduces two new modalities for the Synonym Query Style :
> PICK_BEST_BOOST_BY_PAYLOAD -> build a Disjunction query with the clauses 
> boosted by payload
> AS_DISTINCT_TERMS_BOOST_BY_PAYLOAD -> build a Boolean query with the clauses 
> boosted by payload
> This new synonym query styles will assume payloads are available so they must 
> be used in conjunction with a token filter able to produce payloads.
> An synonym.txt example could be :
> # Synonyms used by Payload Boost
> tiger => tiger|1.0, Big_Cat|0.8, Shere_Khan|0.9
> leopard => leopard, Big_Cat|0.8, Bagheera|0.9
> lion => lion|1.0, panthera leo|0.99, Simba|0.8
> snow_leopard => panthera uncia|0.99, snow leopard|1.0
> A simple token filter to populate the payloads from such synonym.txt is :
>  delimiter="|"/>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9171) Synonyms Boost by Payload

2020-01-25 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023543#comment-17023543
 ] 

Alessandro Benedetti commented on LUCENE-9171:
--

[~romseygeek] , [~jim.ferenczi] , [~jimczi] , [~dsmiley] , [~mikemccand] , 
[~shalin] : I noticed the last activity on the class was from you over the past 
few years, any review and consideration would be welcome :)

> Synonyms Boost by Payload
> -
>
> Key: LUCENE-9171
> URL: https://issues.apache.org/jira/browse/LUCENE-9171
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/queryparser
>Reporter: Alessandro Benedetti
>Priority: Major
>
> I have been working in the additional capability of boosting queries by terms 
> payload through a parameter to enable it in Lucene Query Builder.
> This has been done targeting the Synonyms Query.
> It is parametric, so it meant to see no difference unless the feature is 
> enabled.
> Solr has its bits to comply thorugh its SynonymsQueryStyles



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9171) Synonyms Boost by Payload

2020-01-25 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023541#comment-17023541
 ] 

Alessandro Benedetti commented on LUCENE-9171:
--

You can take a look to the Github Pull Request (that includes also the Solr 
changes).
I will add a set of tests also to the Lucene side.
Current questions:

- is this a clean approach keeping all the payload extraction from token 
streams Lucene side?

- discuss about Paylod decoding and possible use of 
org.apache.lucene.analysis.payloads.PayloadHelper (no visibility from core)

- need to evaluate better the part in: 
org/apache/lucene/util/QueryBuilder.java:629 
(org.apache.lucene.util.QueryBuilder#analyzeGraphBoolean)




> Synonyms Boost by Payload
> -
>
> Key: LUCENE-9171
> URL: https://issues.apache.org/jira/browse/LUCENE-9171
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/queryparser
>Reporter: Alessandro Benedetti
>Priority: Major
>
> I have been working in the additional capability of boosting queries by terms 
> payload through a parameter to enable it in Lucene Query Builder.
> This has been done targeting the Synonyms Query.
> It is parametric, so it meant to see no difference unless the feature is 
> enabled.
> Solr has its bits to comply thorugh its SynonymsQueryStyles



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-12238) Synonym Query Style Boost By Payload

2020-01-25 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023539#comment-17023539
 ] 

Alessandro Benedetti commented on SOLR-12238:
-

The Lucene functionality is described here

> Synonym Query Style Boost By Payload
> 
>
> Key: SOLR-12238
> URL: https://issues.apache.org/jira/browse/SOLR-12238
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Affects Versions: 7.2
>Reporter: Alessandro Benedetti
>Priority: Major
> Attachments: SOLR-12238.patch, SOLR-12238.patch, SOLR-12238.patch, 
> SOLR-12238.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This improvement is built on top of the Synonym Query Style feature and 
> brings the possibility of boosting synonym queries using the payload 
> associated.
> It introduces two new modalities for the Synonym Query Style :
> PICK_BEST_BOOST_BY_PAYLOAD -> build a Disjunction query with the clauses 
> boosted by payload
> AS_DISTINCT_TERMS_BOOST_BY_PAYLOAD -> build a Boolean query with the clauses 
> boosted by payload
> This new synonym query styles will assume payloads are available so they must 
> be used in conjunction with a token filter able to produce payloads.
> An synonym.txt example could be :
> # Synonyms used by Payload Boost
> tiger => tiger|1.0, Big_Cat|0.8, Shere_Khan|0.9
> leopard => leopard, Big_Cat|0.8, Bagheera|0.9
> lion => lion|1.0, panthera leo|0.99, Simba|0.8
> snow_leopard => panthera uncia|0.99, snow leopard|1.0
> A simple token filter to populate the payloads from such synonym.txt is :
>  delimiter="|"/>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9171) Synonyms Boost by Payload

2020-01-25 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-9171:
-
Status: Patch Available  (was: Open)

> Synonyms Boost by Payload
> -
>
> Key: LUCENE-9171
> URL: https://issues.apache.org/jira/browse/LUCENE-9171
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/queryparser
>Reporter: Alessandro Benedetti
>Priority: Major
>
> I have been working in the additional capability of boosting queries by terms 
> payload through a parameter to enable it in Lucene Query Builder.
> This has been done targeting the Synonyms Query.
> It is parametric, so it meant to see no difference unless the feature is 
> enabled.
> Solr has its bits to comply thorugh its SynonymsQueryStyles



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9171) Synonyms Boost by Payload

2020-01-25 Thread Alessandro Benedetti (Jira)
Alessandro Benedetti created LUCENE-9171:


 Summary: Synonyms Boost by Payload
 Key: LUCENE-9171
 URL: https://issues.apache.org/jira/browse/LUCENE-9171
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/queryparser
Reporter: Alessandro Benedetti


I have been working in the additional capability of boosting queries by terms 
payload through a parameter to enable it in Lucene Query Builder.
This has been done targeting the Synonyms Query.
It is parametric, so it meant to see no difference unless the feature is 
enabled.
Solr has its bits to comply thorugh its SynonymsQueryStyles



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8329) Size Estimator wrongly calculate Disk space in MB

2020-01-23 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1706#comment-1706
 ] 

Alessandro Benedetti commented on LUCENE-8329:
--

Any interest in fixing this?
If there is not any intention in maintaining it, should we just remove it 
entirely?
It could be misleading for people using it...

> Size Estimator wrongly calculate Disk space in MB
> -
>
> Key: LUCENE-8329
> URL: https://issues.apache.org/jira/browse/LUCENE-8329
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: general/build
>Affects Versions: 7.3.1
>Reporter: Alessandro Benedetti
>Priority: Minor
> Attachments: LUCENE-8329.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The size estimator dev tool ( dev-tools/size-estimator-lucene-solr.xls 
> )currently :
>  * Wrongly calculates disk size in MB ( showing GB)
>  * Doesn't specify clearly that the space needed by the optimize is FREE space
>  * Avg. Document Size (KB) when they are more correctly Avg. Document Field 
> Size (KB)
> Scope of this issue is just to fix these small mistakes.
>  Out of scope is any improvement to the tool ( potentially separate Jira 
> issues will follow)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8347) BlendedInfixSuggester to handle multi term matches better

2020-01-23 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1700#comment-1700
 ] 

Alessandro Benedetti commented on LUCENE-8347:
--

[~mikemccand], [~dsmiley] I will take this back under my radar, fix the 
conflicts and take a look.
LUCENE-8343  has been settled so we should be able to work on this now!


> BlendedInfixSuggester to handle multi term matches better
> -
>
> Key: LUCENE-8347
> URL: https://issues.apache.org/jira/browse/LUCENE-8347
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: 7.3.1
>Reporter: Alessandro Benedetti
>Priority: Major
> Attachments: LUCENE-8347.patch, LUCENE-8347.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently the blendedInfix suggester considers just the first match position 
> when scoring a suggestion.
> From the lucene-dev mailing list :
> "
> If I write more than one term in the query, let's say 
>  
> "Mini Bar Fridge" 
>  
> I would expect in the results something like (note that allTermsRequired=true 
> and the schema weight field always returns 1000)
>  
> - *Mini Bar Fridge* something
> - *Mini Bar Fridge* something else
> - *Mini Bar* something *Fridge*        
> - *Mini Bar* something else *Fridge*
> - *Mini* something *Bar Fridge*
> ...
>  
> Instead I see this: 
>  
> - *Mini Bar* something *Fridge*        
> - *Mini Bar* something else *Fridge*
> - *Mini Bar Fridge* something
> - *Mini Bar Fridge* something else
> - *Mini* something *Bar Fridge*
> ...
>  
> After having a look at the suggester code 
> (BlendedInfixSuggester.createCoefficient), I see that the component takes in 
> account only one position, which is the lowest position (among the three 
> matching terms) within the term vector ("mini" in the example above) so all 
> the suggestions above have the same weight 
> "
> Scope of this Jira issue is to improve the BlendedInfix to better manage 
> those scenarios.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-12238) Synonym Query Style Boost By Payload

2020-01-23 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021919#comment-17021919
 ] 

Alessandro Benedetti commented on SOLR-12238:
-

Hi [~dsmiley], you are right.
The github pull request was linked in the Jira, I link it here:

[https://github.com/apache/lucene-solr/pull/357/files|https://github.com/apache/lucene-solr/pull/357/files]

Do you want me to split the functionality in two Jiras (and split the code 
changes in two) or just migrate this to Lucene?
It has been a while, so I'll need to catch up with it, I am not sure the Lucene 
modifications are useful on their own.
What's the most common approach when the Lucene modifications are pre-requisite 
specifically for a Solr feature?



> Synonym Query Style Boost By Payload
> 
>
> Key: SOLR-12238
> URL: https://issues.apache.org/jira/browse/SOLR-12238
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Affects Versions: 7.2
>Reporter: Alessandro Benedetti
>Priority: Major
> Attachments: SOLR-12238.patch, SOLR-12238.patch, SOLR-12238.patch, 
> SOLR-12238.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This improvement is built on top of the Synonym Query Style feature and 
> brings the possibility of boosting synonym queries using the payload 
> associated.
> It introduces two new modalities for the Synonym Query Style :
> PICK_BEST_BOOST_BY_PAYLOAD -> build a Disjunction query with the clauses 
> boosted by payload
> AS_DISTINCT_TERMS_BOOST_BY_PAYLOAD -> build a Boolean query with the clauses 
> boosted by payload
> This new synonym query styles will assume payloads are available so they must 
> be used in conjunction with a token filter able to produce payloads.
> An synonym.txt example could be :
> # Synonyms used by Payload Boost
> tiger => tiger|1.0, Big_Cat|0.8, Shere_Khan|0.9
> leopard => leopard, Big_Cat|0.8, Bagheera|0.9
> lion => lion|1.0, panthera leo|0.99, Simba|0.8
> snow_leopard => panthera uncia|0.99, snow leopard|1.0
> A simple token filter to populate the payloads from such synonym.txt is :
>  delimiter="|"/>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4499) Multi-word synonym filter (synonym expansion)

2020-01-17 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018233#comment-17018233
 ] 

Alessandro Benedetti commented on LUCENE-4499:
--

hi [~Tagar], sorry for the late reply, I contributed a patch that is still 
waiting for a review:
[https://issues.apache.org/jira/browse/SOLR-12238|https://issues.apache.org/jira/browse/SOLR-12238]
It's a bit old, so it may require some porting effort, but it could help you.


> Multi-word synonym filter (synonym expansion)
> -
>
> Key: LUCENE-4499
> URL: https://issues.apache.org/jira/browse/LUCENE-4499
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/other
>Affects Versions: 4.1, 6.0
>Reporter: Roman Chyla
>Priority: Major
>  Labels: analysis, multi-word, synonyms
> Fix For: 6.0
>
> Attachments: LUCENE-4499.patch, LUCENE-4499.patch
>
>
> I apologize for bringing the multi-token synonym expansion up again. There is 
> an old, unresolved issue at LUCENE-1622 [1]
> While solving the problem for our needs [2], I discovered that the current 
> SolrSynonym parser (and the wonderful FTS) have almost everything to 
> satisfactorily handle both the query and index time synonym expansion. It 
> seems that people often need to use the synonym filter *slightly* differently 
> at indexing and query time.
> In our case, we must do different things during indexing and querying.
> Example sentence: Mirrors of the Hubble space telescope pointed at XA5
> This is what we need (comma marks position bump):
> indexing: mirrors,hubble|hubble space 
> telescope|hst,space,telescope,pointed,xa5|astroobject#5
> querying: +mirrors +(hubble space telescope | hst) +pointed 
> +(xa5|astroboject#5)
> This translated to following needs:
>   indexing time: 
> single-token synonyms => return only synonyms
> multi-token synonyms => return original tokens *AND* the synonyms
>   query time:
> single-token: return only synonyms (but preserve case)
> multi-token: return only synonyms
>  
> We need the original tokens for the proximity queries, if we indexed 'hubble 
> space telescope'
> as one token, we cannot search for 'hubble NEAR telescope'
> You may (not) be surprised, but Lucene already supports ALL of these 
> requirements. The patch is an attempt to state the problem differently. I am 
> not sure if it is the best option, however it works perfectly for our needs 
> and it seems it could work for general public too. Especially if the 
> SynonymFilterFactory had a preconfigured sets of SynonymMapBuilders - and 
> people would just choose what situation they use. Please look at the unittest.
> links:
> [1] https://issues.apache.org/jira/browse/LUCENE-1622
> [2] http://labs.adsabs.harvard.edu/trac/ads-invenio/ticket/158
> [3] seems to have similar request: 
> http://lucene.472066.n3.nabble.com/Proposal-Full-support-for-multi-word-synonyms-at-query-time-td4000522.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14029) Documentation for the handleSelect and QT is partially incorrect

2019-12-06 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989908#comment-16989908
 ] 

Alessandro Benedetti commented on SOLR-14029:
-

Tentative adjustment of the documentation is in the Pull Request.
Feel free to elaborate more if necessary.

> Documentation for the handleSelect and QT is partially incorrect
> 
>
> Key: SOLR-14029
> URL: https://issues.apache.org/jira/browse/SOLR-14029
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Affects Versions: 8.3.1
>Reporter: Alessandro Benedetti
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In this documentation page:
> https://lucene.apache.org/solr/guide/8_3/requestdispatcher-in-solrconfig.html
> we find this sentence:
> "The first configurable item is the handleSelect attribute on the 
>   
> *A value of "true" will route query requests to the parser defined with the 
> qt value.*"
> This bit is incorrect, it should be:
> *A value of "true" will route query requests to the parser defined with the 
> qt value if the /select request handler is not defined.
> *
> It seems a trivial change but it gave me quite an headache to verify that qt 
> is ignored almost all the times.
> More info in: 
> [https://sease.io/2019/12/the-request-handlers-jungle-handleselect-and-qt-parameter.html|https://sease.io/2019/12/the-request-handlers-jungle-handleselect-and-qt-parameter.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14029) Documentation for the handleSelect and QT is partially incorrect

2019-12-06 Thread Alessandro Benedetti (Jira)
Alessandro Benedetti created SOLR-14029:
---

 Summary: Documentation for the handleSelect and QT is partially 
incorrect
 Key: SOLR-14029
 URL: https://issues.apache.org/jira/browse/SOLR-14029
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: documentation
Affects Versions: 8.3.1
Reporter: Alessandro Benedetti


In this documentation page:
https://lucene.apache.org/solr/guide/8_3/requestdispatcher-in-solrconfig.html

we find this sentence:

"The first configurable item is the handleSelect attribute on the 
  
*A value of "true" will route query requests to the parser defined with the qt 
value.*"

This bit is incorrect, it should be:
*A value of "true" will route query requests to the parser defined with the qt 
value if the /select request handler is not defined.
*

It seems a trivial change but it gave me quite an headache to verify that qt is 
ignored almost all the times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14029) Documentation for the handleSelect and QT is partially incorrect

2019-12-06 Thread Alessandro Benedetti (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated SOLR-14029:

Description: 
In this documentation page:
https://lucene.apache.org/solr/guide/8_3/requestdispatcher-in-solrconfig.html

we find this sentence:

"The first configurable item is the handleSelect attribute on the 
  
*A value of "true" will route query requests to the parser defined with the qt 
value.*"

This bit is incorrect, it should be:
*A value of "true" will route query requests to the parser defined with the qt 
value if the /select request handler is not defined.
*

It seems a trivial change but it gave me quite an headache to verify that qt is 
ignored almost all the times.

More info in: 
[https://sease.io/2019/12/the-request-handlers-jungle-handleselect-and-qt-parameter.html|https://sease.io/2019/12/the-request-handlers-jungle-handleselect-and-qt-parameter.html]

  was:
In this documentation page:
https://lucene.apache.org/solr/guide/8_3/requestdispatcher-in-solrconfig.html

we find this sentence:

"The first configurable item is the handleSelect attribute on the 
  
*A value of "true" will route query requests to the parser defined with the qt 
value.*"

This bit is incorrect, it should be:
*A value of "true" will route query requests to the parser defined with the qt 
value if the /select request handler is not defined.
*

It seems a trivial change but it gave me quite an headache to verify that qt is 
ignored almost all the times.


> Documentation for the handleSelect and QT is partially incorrect
> 
>
> Key: SOLR-14029
> URL: https://issues.apache.org/jira/browse/SOLR-14029
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Affects Versions: 8.3.1
>Reporter: Alessandro Benedetti
>Priority: Major
>
> In this documentation page:
> https://lucene.apache.org/solr/guide/8_3/requestdispatcher-in-solrconfig.html
> we find this sentence:
> "The first configurable item is the handleSelect attribute on the 
>   
> *A value of "true" will route query requests to the parser defined with the 
> qt value.*"
> This bit is incorrect, it should be:
> *A value of "true" will route query requests to the parser defined with the 
> qt value if the /select request handler is not defined.
> *
> It seems a trivial change but it gave me quite an headache to verify that qt 
> is ignored almost all the times.
> More info in: 
> [https://sease.io/2019/12/the-request-handlers-jungle-handleselect-and-qt-parameter.html|https://sease.io/2019/12/the-request-handlers-jungle-handleselect-and-qt-parameter.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-11155) /analysis/field and /analysis/document requests should support points fields

2019-10-03 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943631#comment-16943631
 ] 

Alessandro Benedetti commented on SOLR-11155:
-

{code:java}
/** Given the readable value, return the term value that will match it. */
 public String readableToIndexed(String val) {
 return toInternal(val);
 }
{code}

In case of PointField this will fail, inviting to use the toInternalByteRef, 
shouldn't this be fixed as well?

> /analysis/field and /analysis/document requests should support points fields
> 
>
> Key: SOLR-11155
> URL: https://issues.apache.org/jira/browse/SOLR-11155
> Project: Solr
>  Issue Type: Bug
>Reporter: Steven Rowe
>Assignee: Steven Rowe
>Priority: Blocker
>  Labels: numeric-tries-to-points
> Fix For: 7.0, 7.1, 8.0
>
> Attachments: SOLR-11155.patch, SOLR-11155.patch, SOLR-11155.patch
>
>
> The following added to FieldAnalysisRequestHandlerTest currently fails:
> {code:java}
>   @Test
>   public void testIntPoint() throws Exception {
> FieldAnalysisRequest request = new FieldAnalysisRequest();
> request.addFieldType("pint");
> request.setFieldValue("5");
> handler.handleAnalysisRequest(request, h.getCore().getLatestSchema());
>   }
> {code}
> as follows:
> {noformat}
>[junit4]   2> NOTE: reproduce with: ant test  
> -Dtestcase=FieldAnalysisRequestHandlerTest -Dtests.method=testIntPoint 
> -Dtests.seed=167CC259812871FB -Dtests.slow=true -Dtests.locale=fi-FI 
> -Dtests.timezone=Asia/Hebron -Dtests.asserts=true 
> -Dtests.file.encoding=US-ASCII
>[junit4] ERROR   0.01s | FieldAnalysisRequestHandlerTest.testIntPoint <<<
>[junit4]> Throwable #1: java.lang.UnsupportedOperationException: Can't 
> generate internal string in PointField. use PointField.toInternalByteRef
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([167CC259812871FB:6BF651CEF8FF5B04]:0)
>[junit4]>  at 
> org.apache.solr.schema.PointField.toInternal(PointField.java:187)
>[junit4]>  at 
> org.apache.solr.schema.FieldType$DefaultAnalyzer$1.incrementToken(FieldType.java:488)
>[junit4]>  at 
> org.apache.solr.handler.AnalysisRequestHandlerBase.analyzeTokenStream(AnalysisRequestHandlerBase.java:188)
>[junit4]>  at 
> org.apache.solr.handler.AnalysisRequestHandlerBase.analyzeValue(AnalysisRequestHandlerBase.java:102)
>[junit4]>  at 
> org.apache.solr.handler.FieldAnalysisRequestHandler.analyzeValues(FieldAnalysisRequestHandler.java:225)
>[junit4]>  at 
> org.apache.solr.handler.FieldAnalysisRequestHandler.handleAnalysisRequest(FieldAnalysisRequestHandler.java:186)
>[junit4]>  at 
> org.apache.solr.handler.FieldAnalysisRequestHandlerTest.testIntPoint(FieldAnalysisRequestHandlerTest.java:435)
> {noformat}
> If points fields aren't supported by the FieldAnalysisRequestHandler, then 
> this should be directly stated in the error message, which should be a 4XX 
> error rather than a 5XX error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org