Poor performances with Shingle and Phrase query

2016-01-21 Thread Bertil Chapuis
Hello, I'm trying improve the speed of an index when searching for long phrases. I performed some tests with the benchmark module. With a simple analyser and PhraseQueries and get a throughput of 118 rec/sec. My test dataset is the latest dump of wikipedia. Here is the filters I use at indexation

Re: Poor performances with Shingle and Phrase query

2016-01-21 Thread Doug Turnbull
In my experience, shingles can hurt query performance because the term dictionary grows quite a bit. There's far more unique bigrams than there are words. While the lookup time doesn't grow linearly with the number of terms, it still grows. I haven't specifically compared performance numbers

Re: Poor performances with Shingle and Phrase query

2016-01-21 Thread Jack Krupansky
Be sure to check and see if your app is compute or I/O bound during this process - whether too little of your index is cached in system memory and each query requires I/O, lots of it. -- Jack Krupansky On Thu, Jan 21, 2016 at 1:52 PM, Doug Turnbull < dturnb...@opensourceconnections.com> wrote:

Re: Poor performances with Shingle and Phrase query

2016-01-21 Thread Michael McCandless
Shingles should make a huge different on phrase query performance if 1) the phrase queries involve high frequency terms and 2) you have a substantial number of documents in the index (so that time-to-visit-postings dominates over time-to-lookup-terms). 118 rec/sec is already very fast for a long

Re: Poor performances with Shingle and Phrase query

2016-01-21 Thread Bertil Chapuis
Thank you all for your answers. Initially, I also thought that shingle should make a huge difference. I will give a try to the CommonGramsFilter. In the mean time, these additional informations may help you at identifying a problem in my setup. Basically, I indexed the whole wikipedia dump (> 8