Re: Payloads for each term

2022-01-13 Thread Michael Sokolov
Oh interesting! I did not know about this FeatureField (link was to the old repo, now gone: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/document/FeatureField.java worked for me) On Wed, Nov 11, 2020 at 4:37 PM Mayya Sharipova wrote: > > For sparse vectors,

Re: Payloads for each term

2020-11-17 Thread David Smiley
STUniformSplitPostingsFormat is used in production at a massive scale, helping reduce overall memory needs a ton. I highly recommend it :-) notwithstanding the main caveat of any non-default format: "lucene.experimental" can be applied for different reasons -- the main one is

Re: Payloads for each term

2020-11-16 Thread Ankur Goel
Thanks everyone for helpful suggestions. @Mayya In my use case these features are not term independent which is the primary use case for FeatureField as per the documentation. The FeatureField solution stores features as terms and values as term frequencies. This means that it relies on the

Re: Payloads for each term

2020-11-11 Thread Mayya Sharipova
For sparse vectors, we found that Lucene's FeatureField could also be useful. It stores features as terms and feature values as term frequencies, and provides several convenient

Re: Payloads for each term

2020-11-06 Thread Michael McCandless
Also, be aware that recent Lucene versions enabled compression for BinaryDocValues fields, which might hurt performance of your second solution. This compression is not yet something you can easily turn off, but there are ongoing discussions/PRs about how to make it more easily configurable for

Re: Payloads for each term

2020-11-06 Thread Michael McCandless
In addition to payloads having kinda of high-ish overhead (slow down indexing, do not compress very well I think, and slow down search as you must pull positions), they are also sort of a forced fit for your use case, right? Because a payload in Lucene is per-term-position, whereas you really

Re: Payloads for each term

2020-10-26 Thread Bruno Roustant
Hi Ankur, Indeed payloads are the standard way to solve this problem. For light queries with a few top N results that should be efficient. For multi-term queries that could become penalizing if you need to access the payloads of too many terms. Also, there is an experimental PostingsFormat called

Payloads for each term

2020-10-22 Thread Ankur Goel
Hi Lucene Devs, I have a need to store a sparse feature vector on a per term basis. The total number of possible dimensions are small (~50) and known at indexing time. The feature values will be used in scoring along with corpus statistics. It looks like payloads