On 3/13/2018 7:24 AM, PeterKerk wrote:
PVK COMMENT:
But without a Stopfilter, wont stopwords be included in searches? I though
that for example Google excluded these words in their algorithms?

I just did a google search for "to be or not to be".  It worked flawlessly.

If Google were using stopwords, that search would have returned nothing.  The four words in that search are among the most frequent words found in English prose.  This is a typical stopword list for English:

a
an
and
are
as
at
be
but
by
for
if
in
into
is
it
no
not
of
on
or
such
that
the
their
then
there
these
they
this
to
was
will
with

To explain why the frequent responders on this list recommend not using stopwords, and why the biggest search engine on the planet doesn't use them, you need a small history lesson -- you have to know why stopword filters were invented in the first place.

A search engine works by creating an uninverted index. This means for a typical full-text index that there is a big list of words, and for each of those words, there is a list that identifies the document, field name, and text offset of where that word is found.  Without a stopword filter, the biggest entry in an index for English is probably "the" ... in a corpus of a few million documents, "the" might appear *billions* of times.  So the list is BIG.  And when the search has to deal with a big entry in the uninverted index, it's slower than normal.

Back in the annals of history (80s, 90s, etc) servers didn't have nearly as much memory and CPU resources as they do now.  Eliminating these giant entries in the index made a HUGE difference in search performance.  A search that might take several seconds with the stopwords included could be sped up to less than one second without them.

Even back then, the people who built stopword filters KNEW that they were impacting search results.  The reason they implemented them anyway was to greatly improve search *performance*.  They knew that a search for "to be or not to be" or "the who" or any number of other similar searches wouldn't work properly.  But the vast majority of searches were not really affected by the stopword removal, and users got their results really fast.

Today, with modern hardware, search engines are much less bothered by having enormous entries in the uninverted index.  When stopwords are NOT removed, you can get more accurate search results.  Yes, the index is substantially bigger.  But modern hardware is easy to load up with a lot of disk space, memory, and CPU capacity, and search with stopwords is fast enough.

Thanks,
Shawn

Reply via email to