Re: Identify stopwords using TF-IDF

2019-06-22 Thread Walter Underwood
I haven’t removed stopwords since 1996, when I joined Infoseek. What is your special case where you must remove them? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jun 22, 2019, at 9:51 PM, akash jayaweera > wrote: > > Hello Walter, > > Thank

Re: Identify stopwords using TF-IDF

2019-06-22 Thread akash jayaweera
Hello Walter, Thank you for the reply. But for some of my use-case I need to identify stopword. So I need a better way to identify domain specific stopwords. I used TF-IDF to identify stopwords. But it has the issue I mentioned above. Regards, *Akash Jayaweera.* E akash.jayawe...@gmail.com M

Re: Identify stopwords using TF-IDF

2019-06-22 Thread Walter Underwood
Don’t remove stopwords. That was a useful hack when we were running search engines on 16-bit machines. These days, it causes more problems than it solves. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jun 22, 2019, at 8:14 PM, akash jayaweera >

Identify stopwords using TF-IDF

2019-06-22 Thread akash jayaweera
Hello All, I'm trying to identify stopwords for a non-English corpus using TF-IDF score. I calculated the score for each unique term in the corpus. But my question is how can I select stopwords using the score. For example if we have a corpus of football, term "football" get the lowest TF-IDF

Per-shard Backup/restore

2019-06-22 Thread Mikhail Khludnev
Hello, Do you think backing up and restoring separate shards of collections with implicit routing might be useful? I suppose it might work of certain multitenancy scenarios: when many small indices is created once but might not be used then for a long time. -- Sincerely yours Mikhail Khludnev

Re: ContentStreamUpdateRequest no longer closes stream

2019-06-22 Thread Mikhail Khludnev
FWIW, fixed in 8.2. Thanks, Colvin! On Wed, Jun 12, 2019 at 5:30 PM Colvin Cowie wrote: > I realize that attachments might not work on the mailing list, so here is > the test case on Drive > > https://drive.google.com/file/d/0B7mypFpwbHptTE5nZE0weURFOExFSHphRFlUV0EyTElaOC0w/view?usp=sharing > >

Re: Is Solr can do that ?

2019-06-22 Thread Toke Eskildsen
Matheo Software Info wrote: > My question is very simple ☺ I would like to know if Solr can process > around 30To of data (Pdf, Text, Word, etc…) ? Simple answer: Yes. Assuming 30To means 30 terabyte. > What is the best way to index this huge data ? several servers ? > several shards ? other ?