On 3/13/2018 7:24 AM, PeterKerk wrote:
But without a Stopfilter, wont stopwords be included in searches? I though
that for example Google excluded these words in their algorithms?
I just did a google search for "to be or not to be". It worked flawlessly.
If Google were using stopwords, that search would have returned
nothing. The four words in that search are among the most frequent
words found in English prose. This is a typical stopword list for English:
To explain why the frequent responders on this list recommend not using
stopwords, and why the biggest search engine on the planet doesn't use
them, you need a small history lesson -- you have to know why stopword
filters were invented in the first place.
A search engine works by creating an uninverted index. This means for a
typical full-text index that there is a big list of words, and for each
of those words, there is a list that identifies the document, field
name, and text offset of where that word is found. Without a stopword
filter, the biggest entry in an index for English is probably "the" ...
in a corpus of a few million documents, "the" might appear *billions* of
times. So the list is BIG. And when the search has to deal with a big
entry in the uninverted index, it's slower than normal.
Back in the annals of history (80s, 90s, etc) servers didn't have nearly
as much memory and CPU resources as they do now. Eliminating these
giant entries in the index made a HUGE difference in search
performance. A search that might take several seconds with the
stopwords included could be sped up to less than one second without them.
Even back then, the people who built stopword filters KNEW that they
were impacting search results. The reason they implemented them anyway
was to greatly improve search *performance*. They knew that a search
for "to be or not to be" or "the who" or any number of other similar
searches wouldn't work properly. But the vast majority of searches were
not really affected by the stopword removal, and users got their results
Today, with modern hardware, search engines are much less bothered by
having enormous entries in the uninverted index. When stopwords are NOT
removed, you can get more accurate search results. Yes, the index is
substantially bigger. But modern hardware is easy to load up with a lot
of disk space, memory, and CPU capacity, and search with stopwords is