Re: Re: Re: Re: Re: Protecting Tokens from Any Analysis
yup. youre going to find solr is WAY more efficient than you think when it comes to complex queries. On Wed, Oct 9, 2019 at 3:17 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > True...I guess another rub here is that we're using the edismax parser, so > all of our queries are inherently OR queries. So for a query like 'the ibm > way', the search engine would have to: > > 1) retrieve a document list for: > --> "ibm" (this list is probably 80% of the documents) > --> "the" (this list is 100% of the english documents) > -- >"way" > 2) apply edismax parser > --> foreach term > --> --> foreach document in term > --> --> --> score it > > So, it seems like it would take a toll on our system but maybe that's > incorrect! (For reference, our corpus is ~5MM documents, multi-language, > and we get ~80k-100k queries/day) > > Are you using edismax? > > -- > Audrey Lorberfeld > Data Scientist, w3 Search > IBM > audrey.lorberf...@ibm.com > > > On 10/9/19, 3:11 PM, "David Hastings" > wrote: > > if you have anything close to a decent server you wont notice it all. > im > at about 21 million documents, index varies between 450gb to 800gb > depending on merges, and about 60k searches a day and stay sub second > non > stop, and this is on a single core/non cloud environment > > On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld - > audrey.lorberf...@ibm.com > wrote: > > > Also, in terms of computational cost, it would seem that including > most > > terms/not having a stop ilst would take a toll on the system. For > instance, > > right now we have "ibm" as a stop word because it appears everywhere > in our > > corpus. If we did not include it in the stop words file, we would > have to > > retrieve every single document in our corpus and rank them. That's a > high > > computational cost, no? > > > > -- > > Audrey Lorberfeld > > Data Scientist, w3 Search > > IBM > > audrey.lorberf...@ibm.com > > > > > > On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" > < > > audrey.lorberf...@ibm.com> wrote: > > > > Wow, thank you so much, everyone. This is all incredibly helpful > > insight. > > > > So, would it be fair to say that the majority of you all do NOT > use > > stop words? > > > > -- > > Audrey Lorberfeld > > Data Scientist, w3 Search > > IBM > > audrey.lorberf...@ibm.com > > > > > > On 10/9/19, 11:14 AM, "David Hastings" < > hastings.recurs...@gmail.com> > > wrote: > > > > However, with all that said, stopwords CAN be useful in some > > situations. I > > combine stopwords with the shingle factory to create > "interesting > > phrases" > > (not really) that i use in "my more like this" needs. for > example, > > europe for vacation > > europe on vacation > > will create the shingle > > europe_vacation > > which i can then use to relate other documents that would be > much > > more similar in such regard, rather than just using the > > "interesting words" > > europe, vacation > > > > with stop words, the shingles would be > > europe_for > > for_vacation > > and > > europe_on > > on_vacation > > > > just something to keep in mind, theres a lot of creative > ways to > > use > > stopwords depending on your needs. i use the above for a > VERY > > basic ML > > teacher and it works way better than using stopwords, > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson < > > erickerick...@gmail.com> > > wrote: > > > > > The theory behind stopwords is that they are “safe” to > remove > > when > > > calculating relevance, so we can squeeze every last bit of > > usefulness out > > > of very constrained hardware (think 64K of memory. Yes > > kilobytes). We’ve > > > come a long way since then and the necessity of removing > > stopwords from the > > > indexed tokens to conserve RAM and disk is much less > relevant > > than it used > > > to be in “the bad old days” when the idea of stopwords was > > invented. > > > > > > I’m not quite so confident as Alex that there is “no > benefit”, > > but I’ll > > > totally agree that you should remove stopwords only > _after_ you > > have some > > > evidence that removing them is A Good Thing in your > situation. > > > > > > And removing stopwords leads to some interesting corner > cases. > > Consider a > >
Re: Re: Re: Re: Re: Protecting Tokens from Any Analysis
True...I guess another rub here is that we're using the edismax parser, so all of our queries are inherently OR queries. So for a query like 'the ibm way', the search engine would have to: 1) retrieve a document list for: --> "ibm" (this list is probably 80% of the documents) --> "the" (this list is 100% of the english documents) -- >"way" 2) apply edismax parser --> foreach term --> --> foreach document in term --> --> --> score it So, it seems like it would take a toll on our system but maybe that's incorrect! (For reference, our corpus is ~5MM documents, multi-language, and we get ~80k-100k queries/day) Are you using edismax? -- Audrey Lorberfeld Data Scientist, w3 Search IBM audrey.lorberf...@ibm.com On 10/9/19, 3:11 PM, "David Hastings" wrote: if you have anything close to a decent server you wont notice it all. im at about 21 million documents, index varies between 450gb to 800gb depending on merges, and about 60k searches a day and stay sub second non stop, and this is on a single core/non cloud environment On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > Also, in terms of computational cost, it would seem that including most > terms/not having a stop ilst would take a toll on the system. For instance, > right now we have "ibm" as a stop word because it appears everywhere in our > corpus. If we did not include it in the stop words file, we would have to > retrieve every single document in our corpus and rank them. That's a high > computational cost, no? > > -- > Audrey Lorberfeld > Data Scientist, w3 Search > IBM > audrey.lorberf...@ibm.com > > > On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" < > audrey.lorberf...@ibm.com> wrote: > > Wow, thank you so much, everyone. This is all incredibly helpful > insight. > > So, would it be fair to say that the majority of you all do NOT use > stop words? > > -- > Audrey Lorberfeld > Data Scientist, w3 Search > IBM > audrey.lorberf...@ibm.com > > > On 10/9/19, 11:14 AM, "David Hastings" > wrote: > > However, with all that said, stopwords CAN be useful in some > situations. I > combine stopwords with the shingle factory to create "interesting > phrases" > (not really) that i use in "my more like this" needs. for example, > europe for vacation > europe on vacation > will create the shingle > europe_vacation > which i can then use to relate other documents that would be much > more similar in such regard, rather than just using the > "interesting words" > europe, vacation > > with stop words, the shingles would be > europe_for > for_vacation > and > europe_on > on_vacation > > just something to keep in mind, theres a lot of creative ways to > use > stopwords depending on your needs. i use the above for a VERY > basic ML > teacher and it works way better than using stopwords, > > > > > > > > > > > > > > On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson < > erickerick...@gmail.com> > wrote: > > > The theory behind stopwords is that they are “safe” to remove > when > > calculating relevance, so we can squeeze every last bit of > usefulness out > > of very constrained hardware (think 64K of memory. Yes > kilobytes). We’ve > > come a long way since then and the necessity of removing > stopwords from the > > indexed tokens to conserve RAM and disk is much less relevant > than it used > > to be in “the bad old days” when the idea of stopwords was > invented. > > > > I’m not quite so confident as Alex that there is “no benefit”, > but I’ll > > totally agree that you should remove stopwords only _after_ you > have some > > evidence that removing them is A Good Thing in your situation. > > > > And removing stopwords leads to some interesting corner cases. > Consider a > > search for “to be or not to be” if they’re all stopwords. > > > > Best, > > Erick > > > > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld - > > audrey.lorberf...@ibm.com wrote: > > > > > > Hey Alex, > > > > > > Thank you! > > > > > > Re: stopwords being a thing of the past due to the > affordability of > > hardware...can you expand?
Re: Re: Re: Re: Protecting Tokens from Any Analysis
if you have anything close to a decent server you wont notice it all. im at about 21 million documents, index varies between 450gb to 800gb depending on merges, and about 60k searches a day and stay sub second non stop, and this is on a single core/non cloud environment On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > Also, in terms of computational cost, it would seem that including most > terms/not having a stop ilst would take a toll on the system. For instance, > right now we have "ibm" as a stop word because it appears everywhere in our > corpus. If we did not include it in the stop words file, we would have to > retrieve every single document in our corpus and rank them. That's a high > computational cost, no? > > -- > Audrey Lorberfeld > Data Scientist, w3 Search > IBM > audrey.lorberf...@ibm.com > > > On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" < > audrey.lorberf...@ibm.com> wrote: > > Wow, thank you so much, everyone. This is all incredibly helpful > insight. > > So, would it be fair to say that the majority of you all do NOT use > stop words? > > -- > Audrey Lorberfeld > Data Scientist, w3 Search > IBM > audrey.lorberf...@ibm.com > > > On 10/9/19, 11:14 AM, "David Hastings" > wrote: > > However, with all that said, stopwords CAN be useful in some > situations. I > combine stopwords with the shingle factory to create "interesting > phrases" > (not really) that i use in "my more like this" needs. for example, > europe for vacation > europe on vacation > will create the shingle > europe_vacation > which i can then use to relate other documents that would be much > more similar in such regard, rather than just using the > "interesting words" > europe, vacation > > with stop words, the shingles would be > europe_for > for_vacation > and > europe_on > on_vacation > > just something to keep in mind, theres a lot of creative ways to > use > stopwords depending on your needs. i use the above for a VERY > basic ML > teacher and it works way better than using stopwords, > > > > > > > > > > > > > > On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson < > erickerick...@gmail.com> > wrote: > > > The theory behind stopwords is that they are “safe” to remove > when > > calculating relevance, so we can squeeze every last bit of > usefulness out > > of very constrained hardware (think 64K of memory. Yes > kilobytes). We’ve > > come a long way since then and the necessity of removing > stopwords from the > > indexed tokens to conserve RAM and disk is much less relevant > than it used > > to be in “the bad old days” when the idea of stopwords was > invented. > > > > I’m not quite so confident as Alex that there is “no benefit”, > but I’ll > > totally agree that you should remove stopwords only _after_ you > have some > > evidence that removing them is A Good Thing in your situation. > > > > And removing stopwords leads to some interesting corner cases. > Consider a > > search for “to be or not to be” if they’re all stopwords. > > > > Best, > > Erick > > > > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld - > > audrey.lorberf...@ibm.com wrote: > > > > > > Hey Alex, > > > > > > Thank you! > > > > > > Re: stopwords being a thing of the past due to the > affordability of > > hardware...can you expand? I'm not sure I understand. > > > > > > -- > > > Audrey Lorberfeld > > > Data Scientist, w3 Search > > > IBM > > > audrey.lorberf...@ibm.com > > > > > > > > > On 10/8/19, 1:01 PM, "David Hastings" < > hastings.recurs...@gmail.com> > > wrote: > > > > > >Another thing to add to the above, > > >> > > >> IT:ibm. In this case, we would want to maintain the colon and > the > > >> capitalization (otherwise “it” would be taken out as a > stopword). > > >> > > >stopwords are a thing of the past at this point. there is > no benefit > > to > > >using them now with hardware being so cheap. > > > > > >On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch < > > arafa...@gmail.com> > > >wrote: > > > > > >> If you don't want it to be touched by a tokenizer, how would > the > > >> protection step know that the sequence of characters you want > to > > >> protect is "IT:ibm" and not "this is an IT:ibm term I want to > > >> protect"? > > >> > > >> What it sounds to me is that you may want to: > > >> 1) copyField
Re: Re: Re: Re: Protecting Tokens from Any Analysis
oh and by 'non stop' i mean close enough for me :) On Wed, Oct 9, 2019 at 2:59 PM David Hastings wrote: > if you have anything close to a decent server you wont notice it all. im > at about 21 million documents, index varies between 450gb to 800gb > depending on merges, and about 60k searches a day and stay sub second non > stop, and this is on a single core/non cloud environment > > On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld - > audrey.lorberf...@ibm.com wrote: > >> Also, in terms of computational cost, it would seem that including most >> terms/not having a stop ilst would take a toll on the system. For instance, >> right now we have "ibm" as a stop word because it appears everywhere in our >> corpus. If we did not include it in the stop words file, we would have to >> retrieve every single document in our corpus and rank them. That's a high >> computational cost, no? >> >> -- >> Audrey Lorberfeld >> Data Scientist, w3 Search >> IBM >> audrey.lorberf...@ibm.com >> >> >> On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" < >> audrey.lorberf...@ibm.com> wrote: >> >> Wow, thank you so much, everyone. This is all incredibly helpful >> insight. >> >> So, would it be fair to say that the majority of you all do NOT use >> stop words? >> >> -- >> Audrey Lorberfeld >> Data Scientist, w3 Search >> IBM >> audrey.lorberf...@ibm.com >> >> >> On 10/9/19, 11:14 AM, "David Hastings" >> wrote: >> >> However, with all that said, stopwords CAN be useful in some >> situations. I >> combine stopwords with the shingle factory to create "interesting >> phrases" >> (not really) that i use in "my more like this" needs. for >> example, >> europe for vacation >> europe on vacation >> will create the shingle >> europe_vacation >> which i can then use to relate other documents that would be much >> more similar in such regard, rather than just using the >> "interesting words" >> europe, vacation >> >> with stop words, the shingles would be >> europe_for >> for_vacation >> and >> europe_on >> on_vacation >> >> just something to keep in mind, theres a lot of creative ways to >> use >> stopwords depending on your needs. i use the above for a VERY >> basic ML >> teacher and it works way better than using stopwords, >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson < >> erickerick...@gmail.com> >> wrote: >> >> > The theory behind stopwords is that they are “safe” to remove >> when >> > calculating relevance, so we can squeeze every last bit of >> usefulness out >> > of very constrained hardware (think 64K of memory. Yes >> kilobytes). We’ve >> > come a long way since then and the necessity of removing >> stopwords from the >> > indexed tokens to conserve RAM and disk is much less relevant >> than it used >> > to be in “the bad old days” when the idea of stopwords was >> invented. >> > >> > I’m not quite so confident as Alex that there is “no benefit”, >> but I’ll >> > totally agree that you should remove stopwords only _after_ you >> have some >> > evidence that removing them is A Good Thing in your situation. >> > >> > And removing stopwords leads to some interesting corner cases. >> Consider a >> > search for “to be or not to be” if they’re all stopwords. >> > >> > Best, >> > Erick >> > >> > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld - >> > audrey.lorberf...@ibm.com wrote: >> > > >> > > Hey Alex, >> > > >> > > Thank you! >> > > >> > > Re: stopwords being a thing of the past due to the >> affordability of >> > hardware...can you expand? I'm not sure I understand. >> > > >> > > -- >> > > Audrey Lorberfeld >> > > Data Scientist, w3 Search >> > > IBM >> > > audrey.lorberf...@ibm.com >> > > >> > > >> > > On 10/8/19, 1:01 PM, "David Hastings" < >> hastings.recurs...@gmail.com> >> > wrote: >> > > >> > >Another thing to add to the above, >> > >> >> > >> IT:ibm. In this case, we would want to maintain the colon >> and the >> > >> capitalization (otherwise “it” would be taken out as a >> stopword). >> > >> >> > >stopwords are a thing of the past at this point. there is >> no benefit >> > to >> > >using them now with hardware being so cheap. >> > > >> > >On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch < >> > arafa...@gmail.com> >> > >wrote: >> > > >> > >> If you don't want it to be touched by a tokenizer, how would >> the >> > >>
Re: Re: Re: Re: Protecting Tokens from Any Analysis
Also, in terms of computational cost, it would seem that including most terms/not having a stop ilst would take a toll on the system. For instance, right now we have "ibm" as a stop word because it appears everywhere in our corpus. If we did not include it in the stop words file, we would have to retrieve every single document in our corpus and rank them. That's a high computational cost, no? -- Audrey Lorberfeld Data Scientist, w3 Search IBM audrey.lorberf...@ibm.com On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" wrote: Wow, thank you so much, everyone. This is all incredibly helpful insight. So, would it be fair to say that the majority of you all do NOT use stop words? -- Audrey Lorberfeld Data Scientist, w3 Search IBM audrey.lorberf...@ibm.com On 10/9/19, 11:14 AM, "David Hastings" wrote: However, with all that said, stopwords CAN be useful in some situations. I combine stopwords with the shingle factory to create "interesting phrases" (not really) that i use in "my more like this" needs. for example, europe for vacation europe on vacation will create the shingle europe_vacation which i can then use to relate other documents that would be much more similar in such regard, rather than just using the "interesting words" europe, vacation with stop words, the shingles would be europe_for for_vacation and europe_on on_vacation just something to keep in mind, theres a lot of creative ways to use stopwords depending on your needs. i use the above for a VERY basic ML teacher and it works way better than using stopwords, On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson wrote: > The theory behind stopwords is that they are “safe” to remove when > calculating relevance, so we can squeeze every last bit of usefulness out > of very constrained hardware (think 64K of memory. Yes kilobytes). We’ve > come a long way since then and the necessity of removing stopwords from the > indexed tokens to conserve RAM and disk is much less relevant than it used > to be in “the bad old days” when the idea of stopwords was invented. > > I’m not quite so confident as Alex that there is “no benefit”, but I’ll > totally agree that you should remove stopwords only _after_ you have some > evidence that removing them is A Good Thing in your situation. > > And removing stopwords leads to some interesting corner cases. Consider a > search for “to be or not to be” if they’re all stopwords. > > Best, > Erick > > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld - > audrey.lorberf...@ibm.com wrote: > > > > Hey Alex, > > > > Thank you! > > > > Re: stopwords being a thing of the past due to the affordability of > hardware...can you expand? I'm not sure I understand. > > > > -- > > Audrey Lorberfeld > > Data Scientist, w3 Search > > IBM > > audrey.lorberf...@ibm.com > > > > > > On 10/8/19, 1:01 PM, "David Hastings" > wrote: > > > >Another thing to add to the above, > >> > >> IT:ibm. In this case, we would want to maintain the colon and the > >> capitalization (otherwise “it” would be taken out as a stopword). > >> > >stopwords are a thing of the past at this point. there is no benefit > to > >using them now with hardware being so cheap. > > > >On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch < > arafa...@gmail.com> > >wrote: > > > >> If you don't want it to be touched by a tokenizer, how would the > >> protection step know that the sequence of characters you want to > >> protect is "IT:ibm" and not "this is an IT:ibm term I want to > >> protect"? > >> > >> What it sounds to me is that you may want to: > >> 1) copyField to a second field > >> 2) Apply a much lighter (whitespace?) tokenizer to that second field > >> 3) Run the results through something like KeepWordFilterFactory > >> 4) Search both fields with a boost on the second, higher-signal field > >> > >> The other option is to run CharacterFilter, > >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known > >> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm -> > >> term365". As long as it is done on