Good Morning! is it possible you are mixing up payloads and stored fields? The latter ones are not indexed and can only be used for the top n results. Maybe we're talking about different things..
With the question of how to include the similarities I was actually asking for the way to include the scores of say a LLR value into an index. Do you just take the top x related items and throw the similarity score away? As for the performance: Yes, sorry, that was a little bragging and not really informative :) . On Mon, Nov 5, 2012 at 11:38 PM, Ted Dunning <[email protected]> wrote: > On Mon, Nov 5, 2012 at 12:06 PM, Johannes Schulte < > [email protected]> wrote: > > > > > do you really mean payloads? Because i consider them part of the index as > > they are stored per position and can be accessed during scoring. > > > > I had the impression that they were not indexed. They are definitely > available if you pull the document, but for high speed scoring, you should > not do that if you possibly can avoid it. > > > > How would you then incorporate the similarities in an index. With a faked > > term frequency? > > > > You don't actually need to fake the term frequency. You can do that if you > really want to adjust the weightings, but the native scoring in most > retrieval engines is close enough to what you want that the benefits of > coherent integration of multiple kinds of data over-powers the defects > introduced (and it isn't clear that they actually are defects). > > > > > I always felt that payloads are a very natural and fast way of storing > big > > item-to-item relationships with additional content. You dont have to load > > everything into memory or use something like a database like you have to > do > > with the current Mahout DataModel. > > > I agree that databases are disasters for this. > > But can you access the payload without cracking open the document store? > > > > Instead you have the caching goodness of > > the lucene mmap directories without having to worry about heap. At least > > we're encountering sub miliseconds response time this way... > > > > This is impressive. At what scale? >
