Re: Conneting Lucene with ChatGPT Retrieval Plugin

2023-05-10 Thread Gus Heck
Do you anticipate that the vector engine would be changed in a way that
fundamentally precluded larger vectors (intentionally)? I would think that
the ability to support larger vectors should be a key criteria for any
changes to be made. Certainly if there are optimizations to be had at
specific sizes (due to power of 2 size or some other numerical coincidence)
found in the future we should have ways of picking that up if people use
the beneficial size, but I don't understand the idea that we would support
a change to the engine that would preclude larger vectors in the long run.
It makes great sense to have a default limit because it's important to
communicate that "beyond this point we haven't tested, we don't know what
happens and you are on your own" but forcing a code fork for folks to do
that testing only creates a barrier if they find something useful that they
want to contribute back...

On the proposal's thread I like the configurability option fwiw.

On Tue, May 9, 2023 at 12:49 PM Bruno Roustant 
wrote:

> I agree with Robert Muir that an increase of the 1024 limit as it is
> currently in FloatVectorValues or ByteVectorValues would bind the API, we
> could not decrease it after, even if we needed to change the vector engine.
>
> Would it be possible to move the limit definition to a HNSW specific
> implementation, where it would only bind HNSW?
> I don't know this area of code well. It seems to me the FloatVectorValues
> implementation is unfortunately not HNSW specific. Is this on purpose? We
> should be able to replace the vector engine, no?
>
> Le sam. 6 mai 2023 à 22:44, Michael Wechner  a
> écrit :
>
>> there is already a pull request for Elasticsearch which is also
>> mentioning the max size 1024
>>
>> https://github.com/openai/chatgpt-retrieval-plugin/pull/83
>>
>>
>>
>> Am 06.05.23 um 19:00 schrieb Michael Wechner:
>> > Hi Together
>> >
>> > I recently setup ChatGPT retrieval plugin locally
>> >
>> > https://github.com/openai/chatgpt-retrieval-plugin
>> >
>> > I think it would be nice to consider to submit a Lucene implementation
>> > for this plugin
>> >
>> > https://github.com/openai/chatgpt-retrieval-plugin#future-directions
>> >
>> > The plugin is using by default OpenAI's model "text-embedding-ada-002"
>> > with 1536 dimensions
>> >
>> > https://openai.com/blog/new-and-improved-embedding-model
>> >
>> > but which means one won't be able to use it out-of-the-box with Lucene.
>> >
>> > Similar request here
>> >
>> >
>> https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions
>> >
>> >
>> > I understand we just recently had a lenghty discussion about
>> > increasing the max dimension and whatever one thinks of OpenAI, fact
>> > is, that it has a huge impact and I think it would be nice that Lucene
>> > could be part of this "revolution". All we have to do is increase the
>> > limit from 1024 to 1536 or even 2048 for example.
>> >
>> > Since the performace seems to be linear with the vector dimension and
>> > several members have done performance tests successfully and 1024
>> > seems to have been chosen as max dimension quite arbitrarily in the
>> > first place, I think it should not be a problem to increase the max
>> > dimension by a factor 1.5 or 2.
>> >
>> > WDYT?
>> >
>> > Thanks
>> >
>> > Michael
>> >
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)


Re: Conneting Lucene with ChatGPT Retrieval Plugin

2023-05-10 Thread Jonathan Ellis
I did track down a weird bug I was seeing to our cosine similarity
returning NaN with high dimension vectors.  Fix is here:
https://github.com/apache/lucene/pull/12281

On Tue, May 9, 2023 at 12:15 PM Jonathan Ellis  wrote:

> I'm adding Lucene HNSW to Cassandra for vector search.  One of my test
> harnesses loads 50k openai embeddings.  Works as expected; as someone
> pointed out, it should be linear wrt vector size and that is what I see.  I
> would not be afraid of increasing the max size.
>
> In parallel, Cassandra is also adding numerical indexes using Lucene's k-d
> tree.  We definitely expect people to want to compose the two (topK vector
> matches that also satisfy some other predicates).
>
> But I agree that classic term based relevance queries are probably less
> useful combined w/ vector search.
>
>
> On Tue, May 9, 2023 at 11:59 AM Jun Luo  wrote:
>
>> The pr mentioned a Elasticsearch pr
>>  that increased the
>> dim to 2048 in ElasticSearch.
>>
>> Curious how you use Lucene's KNN search. Lucene's KNN supports one vector
>> per document. Usually multiple/many vectors are needed for a document
>> content. We will have to split the document content into chunks and create
>> one Lucene document per document chunk.
>>
>> ChatGPT plugin directly stores the chunk text in the underline vector db.
>> If there are lots of documents, will it be a concern to store the full
>> document content in Lucene? In the traditional inverted index use case, is
>> it common to store the full document content in Lucene?
>>
>> Another question: if you use Lucene as a vector db, do you still need the
>> inverted index? Wondering what would be the use case to use inverted index
>> together with vector index. If we don't need the inverted index, will it be
>> better to use other vector dbs? For example, PostgreSQL also added vector
>> support recently.
>>
>> Thanks,
>> Jun
>>
>> On Sat, May 6, 2023 at 1:44 PM Michael Wechner 
>> wrote:
>>
>>> there is already a pull request for Elasticsearch which is also
>>> mentioning the max size 1024
>>>
>>> https://github.com/openai/chatgpt-retrieval-plugin/pull/83
>>>
>>>
>>>
>>> Am 06.05.23 um 19:00 schrieb Michael Wechner:
>>> > Hi Together
>>> >
>>> > I recently setup ChatGPT retrieval plugin locally
>>> >
>>> > https://github.com/openai/chatgpt-retrieval-plugin
>>> >
>>> > I think it would be nice to consider to submit a Lucene implementation
>>> > for this plugin
>>> >
>>> > https://github.com/openai/chatgpt-retrieval-plugin#future-directions
>>> >
>>> > The plugin is using by default OpenAI's model "text-embedding-ada-002"
>>> > with 1536 dimensions
>>> >
>>> > https://openai.com/blog/new-and-improved-embedding-model
>>> >
>>> > but which means one won't be able to use it out-of-the-box with Lucene.
>>> >
>>> > Similar request here
>>> >
>>> >
>>> https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions
>>> >
>>> >
>>> > I understand we just recently had a lenghty discussion about
>>> > increasing the max dimension and whatever one thinks of OpenAI, fact
>>> > is, that it has a huge impact and I think it would be nice that Lucene
>>> > could be part of this "revolution". All we have to do is increase the
>>> > limit from 1024 to 1536 or even 2048 for example.
>>> >
>>> > Since the performace seems to be linear with the vector dimension and
>>> > several members have done performance tests successfully and 1024
>>> > seems to have been chosen as max dimension quite arbitrarily in the
>>> > first place, I think it should not be a problem to increase the max
>>> > dimension by a factor 1.5 or 2.
>>> >
>>> > WDYT?
>>> >
>>> > Thanks
>>> >
>>> > Michael
>>> >
>>> >
>>> >
>>> > -
>>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced


Re: New branch and feature freeze for Lucene 9.6.0

2023-05-10 Thread Alan Woodward
Thanks Ishan, turns out the error was between chair and keyboard - I’d told the 
wizard to use the gradle java plugin to sign things when I should have been 
using gpg.

> On 2 May 2023, at 14:19, Ishan Chattopadhyaya  
> wrote:
> 
> Don't remember the specifics, but I ran into GPG issues during Solr 9.1.0 
> release. The fix for me was https://github.com/apache/solr/pull/1125, but I 
> don't know if this is the same problem or if it is applicable in Lucene's 
> case.
> 
> On Tue, 2 May 2023 at 18:27, Alan Woodward  > wrote:
>> I am fighting with gradle and GPG yet again… Gradle fails when trying to 
>> sign artefacts with the message "Cannot perform signing task 
>> ':lucene:distribution:signReleaseArchives' because it has no configured 
>> signatory”.  I have GPG configured in ~/.gradle/gradle.properties as follows:
>> 
>> org.gradle.caching=true
>> signing.keyId=
>> signing.secretKeyRingFile=/Users/romseygeek/.gnupg/secring.gpg
>> signing.gnupg.executable=gpg
>> 
>> This worked last time I did a release.  Does anybody know if anything has 
>> changed in gradle that means I need to change the properties file, or have 
>> any other ideas?
>> 
>> > On 27 Apr 2023, at 10:54, Alan Woodward > > > wrote:
>> > 
>> > I have started a release note here: 
>> > https://cwiki.apache.org/confluence/display/LUCENE/Release+Notes+9.6
>> > 
>> >> On 27 Apr 2023, at 09:45, Alan Woodward > >> > wrote:
>> >> 
>> >> I have successfully wrestled Jenkins into submission, and there are now 
>> >> 9.6 jobs for Artifacts, Check and NightlyTests.
>> >> 
>> >>> On 26 Apr 2023, at 16:53, Alan Woodward > >>> > wrote:
>> >>> 
>> >>> NOTICE:
>> >>> 
>> >>> Branch branch_9_6 has been cut and versions updated to 9.7 on stable 
>> >>> branch.
>> >>> 
>> >>> Please observe the normal rules:
>> >>> 
>> >>> * No new features may be committed to the branch.
>> >>> * Documentation patches, build patches and serious bug fixes may be
>> >>> committed to the branch. However, you should submit all patches you
>> >>> want to commit as pull requests first to give others the chance to review
>> >>> and possibly vote against them. Keep in mind that it is our
>> >>> main intention to keep the branch as stable as possible.
>> >>> * All patches that are intended for the branch should first be committed
>> >>> to the unstable branch, merged into the stable branch, and then into
>> >>> the current release branch.
>> >>> * Normal unstable and stable branch development may continue as usual.
>> >>> However, if you plan to commit a big change to the unstable branch
>> >>> while the branch feature freeze is in effect, think twice: can't the
>> >>> addition wait a couple more days? Merges of bug fixes into the branch
>> >>> may become more difficult.
>> >>> * Only Github issues with Milestone 9.6
>> >>> and priority "Blocker" will delay a release candidate build.
>> >>> 
>> >>> 
>> >>> I am struggling to find the lucene Jenkins jobs on the new apache build 
>> >>> server at https://jenkins-ccos.apache.org/ - if anybody has any hints as 
>> >>> to how to navigate the helpful new interface with a non-functional 
>> >>> search box, I would be very grateful…
>> >>> 
>> >>> It’s a holiday weekend coming up in the UK, so my plan is to give 
>> >>> Jenkins a few days to chew things over (once I actually get the jobs 
>> >>> running) and then build a RC on Tuesday 2nd May.
>> >> 
>> > 
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
>> 
>> For additional commands, e-mail: dev-h...@lucene.apache.org 
>> 
>> 



[ANNOUNCE] Apache Lucene 9.6.0 released

2023-05-10 Thread Alan Woodward
The Lucene PMC is pleased to announce the release of Apache Lucene 9.6.0.

Apache Lucene is a high-performance, full-featured search engine library 
written entirely in Java. It is a technology suitable for nearly any 
application that requires structured search, full-text search, faceting, 
nearest-neighbor search across high-dimensionality vectors, spell correction or 
query suggestions.

This release contains numerous bug fixes, optimizations, and improvements, some 
of which are highlighted below. The release is available for immediate download 
at:

 

### Lucene 9.6.0 Release Highlights:

* Introduce a new KeywordField for simple and efficient  filtering, sorting and 
faceting.
* Add support for Java 20 foreign memory API. If exactly Java 19 or 20 is used, 
MMapDirectory will mmap Lucene indexes in chunks of 16 GiB (instead of 1 GiB) 
and indexes closed while queries are running can no longer crash the JVM.
* Improved performance for TermInSetQuery, PrefixQuery, WildcardQuery and 
TermRangeQuery
* Lower memory usage for BloomFilteringPostingsFormat
* Faster merges for HNSW indexes
* Improvements to concurrent indexing throughput under heavy load
* Correct equals implementation in SynonymQuery
* 'explain' is now implemented on TermAutomatonQuery

Please read CHANGES.txt for a full list of new features and changes:

 
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Dimensions Limit for KNN vectors - Next Steps

2023-05-10 Thread Bruno Roustant
*Proposed option:* Move the max dimension limit lower level to a HNSW
specific implementation. Once there, this limit would not bind any other
potential vector engine alternative/evolution.

*Motivation:* There seem to be contradictory performance interpretations
about the current HNSW implementation. Some consider its performance ok,
some not, and it depends on the target data set and use-case. Increasing
the max dimension limit where it is currently (in top level
FloatVectorValues) would not allow potential alternatives (e.g. for other
use-cases) to be based on a lower limit.

Bruno