Hello,
I'm trying improve the speed of an index when searching for long phrases. I
performed some tests with the benchmark module. With a simple analyser and
PhraseQueries and get a throughput of 118 rec/sec. My test dataset is the
latest dump of wikipedia. Here is the filters I use at indexation
In my experience, shingles can hurt query performance because the term
dictionary grows quite a bit. There's far more unique bigrams than there
are words. While the lookup time doesn't grow linearly with the number of
terms, it still grows.
I haven't specifically compared performance numbers
Be sure to check and see if your app is compute or I/O bound during this
process - whether too little of your index is cached in system memory and
each query requires I/O, lots of it.
-- Jack Krupansky
On Thu, Jan 21, 2016 at 1:52 PM, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:
Shingles should make a huge different on phrase query performance if
1) the phrase queries involve high frequency terms and 2) you have a
substantial number of documents in the index (so that
time-to-visit-postings dominates over time-to-lookup-terms).
118 rec/sec is already very fast for a long
Thank you all for your answers. Initially, I also thought that shingle
should make a huge difference. I will give a try to the CommonGramsFilter.
In the mean time, these additional informations may help you at identifying
a problem in my setup.
Basically, I indexed the whole wikipedia dump (> 8