[GitHub] [lucene] rmuir commented on issue #11870: Create a Markdown based documentation

2023-01-05 Thread GitBox


rmuir commented on issue #11870:
URL: https://github.com/apache/lucene/issues/11870#issuecomment-1373155758

   The various overview.html's might even be the easiest ones to think about 
how markdown could work, rather than package summaries.
   
   These are currently maintained as html files, and passed to the javadoc 
command with `-overview file.html`. Maybe they could be maintained as README.md 
files instead that get preprocessed to overview.html? 
   
   I like the idea that browsing lucene/core/src/java just in github would then 
show the overview automatically, but... all the links to any classes are gonna 
be broken without support javadoc, so i'm not sure of the value we get from 
markdown over just keeping them as html. Plus the additional .md->.html 
indirection would add some complexity over the current files. But I guess 
possibly it might be easier to contribute to?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11870: Create a Markdown based documentation

2023-01-05 Thread GitBox


rmuir commented on issue #11870:
URL: https://github.com/apache/lucene/issues/11870#issuecomment-1373144282

   There is some markdown processed in this way for the release: 
https://lucene.apache.org/core/9_4_0/
   
   Source code is here: 
https://github.com/apache/lucene/blob/main/lucene/documentation/src/markdown/index.template.md
   
   I tend to agree that the long package summaries might be better as markdown, 
these package summaries don't necessarily get a lot of visibility via tools 
like IDEs. Same goes with the module overviews such as 
https://lucene.apache.org/core/9_4_2/core/index.html . Both of these tend to be 
the places with the more verbose explanations.
   
   But I also agree with some of Dawid's thoughts too.
   * if these summaries/overview docs are no longer javadoc but instead 
markdown, it would be better to allow these to be organized per-module rather 
than having everything in `lucene/documentation`
   * a little concerned about navigation: having the content in javadocs does 
this easy: just click "Package" or "Overview". If we markdown and javadocs, I 
don't know what it would feel like when browsing through it.
   * maintenance is a serious concern. one thing that really helps is that we 
run some serious javadocs linting and broken-link detector across all of our 
docs.  helps fail the build if things are out of date. we'd at least want to 
make sure we still do broken-links detection across any markdown.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #12050: Reuse HNSW graph for intialization during merge

2023-01-05 Thread GitBox


jmazanec15 commented on code in PR #12050:
URL: https://github.com/apache/lucene/pull/12050#discussion_r1062966074


##
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java:
##
@@ -94,36 +93,83 @@ public int size() {
   }
 
   /**
-   * Add node on the given level
+   * Add node on the given level. Nodes can be inserted out of order, but it 
requires that the nodes

Review Comment:
   > but still because in L156 we need to copy the rest of array again and 
again as long as that is a non-appending action
   
   Right, this could be expensive for out of order insertion. I can try 
switching the nodeByLevel int array to a TreeSet and compare performance to 
https://github.com/apache/lucene/issues/11354.
   
   One complication with this approach is that the NodesIterator expects an int 
array: 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java#L134.
 Given this is a public interface, we might need to either convert the treeset 
to an int array every time 
[getNodesOnLevel](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java#L165)
 gets called, or alter the NodesIterator interface to support both an int array 
and an Iterator produced from the TreeSet.
   
   @zhaih What do you think of this approach? Is there better way to do this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on issue #11354: Reuse HNSW graphs when merging segments? [LUCENE-10318]

2023-01-05 Thread GitBox


jmazanec15 commented on issue #11354:
URL: https://github.com/apache/lucene/issues/11354#issuecomment-1372817284

   @msokolov here are the results re-using a single index for each experiment. 
Overall, there is still some variability, it seems like there is less. For the 
10K results, it appears that control performed better, however, the recall is 
slightly worse. 
   
    10K
   | Exper.  | time to merge (ms) | QPS | Recall | Size vec (MB) |  Size 
vem (KB) | Size vex (MB) |
   | --- | - | --- | -- | -- | 
-- | -- |
   | Control 1 | 696096 | 684 | 0.979 | 512.0001 | 70.172 | 60.62953 |
   | Control 2 | 695400 | 724 | 0.979 | 512.0001 | 70.172 | 60.62953 |
   | Control 3 | 710602 | 699 | 0.979 | 512.0001 | 70.172 | 60.62953 |
   | Test 1 | 736711 | 649 | 0.98 | 512.0001 | 70.129 | 60.62525 |
   | Test 2 | 742799 | 751 | 0.98 | 512.0001 | 70.129 | 60.62525 |
   | Test 3 | 742263 | 746 | 0.98 | 512.0001 | 70.129 | 60.62525 |
   
   
    100K
   | Exper.  | time to merge (ms) | QPS | Recall | Size vec (MB) |  Size 
vem (KB) | Size vex (MB) |
   | --- | - | --- | -- | -- | 
-- | -- |
   | Control 1 | 714349 | 689 | 0.981 | 512.0001 | 70.172 | 60.44963 |
   | Control 2 | 703428 | 763 | 0.981 | 512.0001 | 70.172 | 60.44963 |
   | Control 3 | 721943 | 666 | 0.981 | 512.0001 | 70.172 | 60.44963 |
   | Test 1 | 669922 | 729 | 0.981 | 512.0001 | 70.26 | 60.45246 |
   | Test 2 | 682579 | 729 | 0.981 | 512.0001 | 70.26 | 60.45246 |
   | Test 3 | 659374 | 724 | 0.981 | 512.0001 | 70.26 | 60.45246 |
   
    500K
   | Exper.  | time to merge (ms) | QPS | Recall | Size vec (MB) |  Size 
vem (KB) | Size vex (MB) |
   | --- | - | --- | -- | -- | 
-- | -- |
   | Control 1 | 674606 | 751 | 0.98 | 512.0001 | 70.172 | 59.69535 |
   | Control 2 | 657207 | 699 | 0.98 | 512.0001 | 70.172 | 59.69535 |
   | Control 3 | 664536 | 694 | 0.98 | 512.0001 | 70.172 | 59.69535 |
   | Test 1 | 381532 | 793 | 0.98 | 512.0001 | 70.256 | 59.69746 |
   | Test 2 | 371540 | 793 | 0.98 | 512.0001 | 70.256 | 59.69746 |
   | Test 3 | 382440 | 800 | 0.98 | 512.0001 | 70.256 | 59.69746 |
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on issue #11870: Create a Markdown based documentation

2023-01-05 Thread GitBox


dweiss commented on issue #11870:
URL: https://github.com/apache/lucene/issues/11870#issuecomment-1372325036

   What I meant is that this documentation should really go into modules/ 
classes where it belongs and can be updated/ maintained together with the code. 
I honestly don't believe the effort to write a separate manual will be kept in 
sync with the code. I am with you on many libraries having excellent 
documentation - it'd be great to have it. The truth is, it's a huge effort not 
many people will have time for (or the interest in doing, compared to writing 
new features or tinkering with the code). Sorry to sound so pessimistic - 
you're welcome to do anything you like, of course - that's the beauty of open 
source. 
   
   Also, perhaps chatgpt can emit this automatically in a few months if you 
point it at the source code?... :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #12064: Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String)

2023-01-05 Thread GitBox


benwtrent commented on PR #12064:
URL: https://github.com/apache/lucene/pull/12064#issuecomment-1372294789

   Digging into it more, removing `AbstractVectorValues` will add a fair bit of 
extra code to the KnnVectorWriters and testing (though testing is a lesser 
concern I suppose). 
   
   My thoughts on keeping it are that eventually, we will want to add support 
to binary vectors (to be used specifically with hamming distance) and 
half-float (or float16, admittedly, this one may wait until JVM has better 
float16 support).
   
   I am not sure there are other vector encodings we will want to support, but 
I can see Lucene supporting at least these 4 (including our byte & float32) 
eventually.
   
   There is already a fair bit of duplication. If the prevailing opinion is 
completely remove `AbstractVectorValues` and make the writers handle 
individual vector encodings (instead of relying on the underlying BytesRef), I 
will comply.
   
   What say you @rmuir && @jpountz ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jebnix commented on issue #11870: Create a Markdown based documentation

2023-01-05 Thread GitBox


jebnix commented on issue #11870:
URL: https://github.com/apache/lucene/issues/11870#issuecomment-1372225809

   @dweiss But currently, it's very hard and unintuitive to learn Lucene as a 
new user. In most libraries these days there is a docusaurus-like engine that 
generates a pretty nice and intuitive website that enables the user to find all 
of the beginners to intermediate information he needs about using the library, 
all in one unified place. That's also much more comfortable for future 
contributors to find the docs. Currently, the docs are spread all over the 
Lucene code base. That's nice when you dig in, but it makes it really hard to 
find out where're the docs for new users.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org