I haven’t removed stopwords since 1996, when I joined Infoseek. What is your 
special case where you must remove them?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 22, 2019, at 9:51 PM, akash jayaweera <akash.jayawe...@gmail.com> 
> wrote:
> 
> Hello Walter,
> 
> Thank you for the reply.
> But for some of my use-case I need to identify stopword. So I need a better
> way to identify domain specific stopwords. I used TF-IDF to identify
> stopwords. But it has the issue I mentioned above.
> 
> Regards,
> *Akash Jayaweera.*
> 
> 
> E akash.jayawe...@gmail.com <akash.jayawe...@gmail.com>
> M + 94 77 2472635 <+94%2077%20247%202635>
> 
> 
> On Sun, Jun 23, 2019 at 10:13 AM Walter Underwood <wun...@wunderwood.org>
> wrote:
> 
>> Don’t remove stopwords. That was a useful hack when we were running search
>> engines on 16-bit machines. These days, it causes more problems than it
>> solves.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jun 22, 2019, at 8:14 PM, akash jayaweera <akash.jayawe...@gmail.com>
>> wrote:
>>> 
>>> Hello All,
>>> I'm trying to identify stopwords for a non-English corpus using TF-IDF
>>> score. I calculated the score for each unique term in the corpus. But my
>>> question is how can I select stopwords using the score.
>>> For example if we have a corpus of football, term "football" get the
>> lowest
>>> TF-IDF score. But for my requirement I don't want to identify "football"
>> as
>>> a stopword.
>>> How can I clearly Identify stopword. Is there any other simple method to
>>> identify stopwords than TF-IDF score.
>>> 
>>> Regards,
>>> *Akash Jayaweera.*
>>> 
>>> 
>>> E akash.jayawe...@gmail.com <akash.jayawe...@gmail.com>
>>> M + 94 77 2472635 <+94%2077%20247%202635>
>> 
>> 

Reply via email to