pagination with searchAfter

2022-09-23 Thread erel
I’ve never used searchAfter before so looking for some tips and hints. I understand that I need to maintain a server side cache with the relevant ScoreDocs, right? The index is refreshed every couple of minutes. How will that affect the cached ScoreDocs? I don’t mind too much having some inco

Re: Best practice - preparing search term for Lucene

2022-09-23 Thread Hrvoje Lončar
Good point! For now I'll leave it normalized. Every search term coming from frontend is stored and also its counter updated which will help me after some time to see trends and to decide to change the logic or not. P.S. Here is the funny part: in Croatian "pišanje" means peeing while "pisanje" mea

Re: Best practice - preparing search term for Lucene

2022-09-23 Thread Stephane Passignat
Hi I would don't store the original value. That's "just" an index. But store the value of your db identifiers, because I think you'll want it at some point. (I made the same kind of feature on top of datanucleus) I use to have tech id in my db. Even more since I started to use jdo jpa some 20

Re: Best practice - preparing search term for Lucene

2022-09-23 Thread Michael Sokolov
I think it depends how precise you want to make the search. If you want to enable diacritic-sensitive search in order to avoid confusions when users actually are able to enter the diacritics, you can index both ways (ascii-folded and not folded) and not normalize the query terms. Or you can just fo

Re: Max Field Length

2022-09-23 Thread Michael Sokolov
ooh On Fri, Sep 23, 2022 at 11:02 AM Adrien Grand wrote: > > We have a TruncateTokenFilter in lucene/analysis/common. :) > > On Fri, Sep 23, 2022 at 4:39 PM Michael Sokolov wrote: > > > I wonder if it would make sense to provide a TruncationFilter in > > addition to the LengthFilter. That way lo

Re: Max Field Length

2022-09-23 Thread Adrien Grand
We have a TruncateTokenFilter in lucene/analysis/common. :) On Fri, Sep 23, 2022 at 4:39 PM Michael Sokolov wrote: > I wonder if it would make sense to provide a TruncationFilter in > addition to the LengthFilter. That way long tokens in source text > could be better supported, albeit with some

Re: Max Field Length

2022-09-23 Thread Michael Sokolov
I wonder if it would make sense to provide a TruncationFilter in addition to the LengthFilter. That way long tokens in source text could be better supported, albeit with some confusion if they share the same very long prefix... On Fri, Sep 23, 2022 at 9:56 AM Scott Guthery wrote: > > Thanks much,

Re: Max Field Length

2022-09-23 Thread Scott Guthery
Thanks much, Adrian. I hadn't realized that the size limit was on one token in the text as opposed to being a limit on the length of the entire text field. I'm loading patents, so I suspect that the very long word is a DNA sequence. Thanks also for your guidance with regard to setting maximums.

Re: Questions about Lucene source

2022-09-23 Thread Adrien Grand
On the 2nd question, we do not plan on leveraging this information to figure out the codec: the codec that should be used to read a segment is stored separately (also in segment infos). It is mostly useful for diagnostics purposes. E.g. if we see an interesting corruption case where checksums matc

Re: Max Field Length

2022-09-23 Thread Adrien Grand
Hi Scott, There is no way to lift this limit. The assumption is that a user would never type a 32kB keyword in a search bar, so indexing such long keywords is wasteful. Some tokenizers like StandardTokenizer can be configured to limit the length of the tokens that they produce, there is also a Len

Re: Best practice - preparing search term for Lucene

2022-09-23 Thread Hrvoje Lončar
Hi Stephane! Actually, I have excactly that kind of conversion, but I didn't mention as my mail was long enough whithout it :) My main concern it should I let Lucene index original keywords or not. Considering what you wrote, I guess your answer would be to store only converted values without exot