Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Walter Underwood
I wouldn’t worry about performance with that setup. I just checked on a production system with 13 million docs in four shards, so 3+ million per shard. I searched on the most common term in the title field and got a response in 31 milliseconds. This was probably not cached, because the

Re: Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings
yup. youre going to find solr is WAY more efficient than you think when it comes to complex queries. On Wed, Oct 9, 2019 at 3:17 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > True...I guess another rub here is that we're using the edismax parser, so > all of our queries are

Re: Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
True...I guess another rub here is that we're using the edismax parser, so all of our queries are inherently OR queries. So for a query like 'the ibm way', the search engine would have to: 1) retrieve a document list for: --> "ibm" (this list is probably 80% of the documents) --> "the"

Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings
if you have anything close to a decent server you wont notice it all. im at about 21 million documents, index varies between 450gb to 800gb depending on merges, and about 60k searches a day and stay sub second non stop, and this is on a single core/non cloud environment On Wed, Oct 9, 2019 at

Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings
only in my more like this tools, but they have a very specific purpose, otherwise no On Wed, Oct 9, 2019 at 2:31 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > Wow, thank you so much, everyone. This is all incredibly helpful insight. > > So, would it be fair to say that the majority

Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings
oh and by 'non stop' i mean close enough for me :) On Wed, Oct 9, 2019 at 2:59 PM David Hastings wrote: > if you have anything close to a decent server you wont notice it all. im > at about 21 million documents, index varies between 450gb to 800gb > depending on merges, and about 60k searches

Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Also, in terms of computational cost, it would seem that including most terms/not having a stop ilst would take a toll on the system. For instance, right now we have "ibm" as a stop word because it appears everywhere in our corpus. If we did not include it in the stop words file, we would have

Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings
Yeah, I dont use it as a search, only well, finding more documents like that one :) . for my purposes i tested between 2 to 5 part shingles and ended up that the 2 part was actually giving me better results, for my use case, than using any more. I dont suppose you could point me to any of the

Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Walter Underwood
We did something like that with Infoseek and Ultraseek. We had a set of “glue words” that made noun phrases and indexed patterns like “noun glue noun” as single tokens. I remember Doug Cutting saying that Nutch did something similar using pairs, but using that as a prefilter instead of as a

Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Wow, thank you so much, everyone. This is all incredibly helpful insight. So, would it be fair to say that the majority of you all do NOT use stop words? -- Audrey Lorberfeld Data Scientist, w3 Search IBM audrey.lorberf...@ibm.com On 10/9/19, 11:14 AM, "David Hastings" wrote: However,

Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings
However, with all that said, stopwords CAN be useful in some situations. I combine stopwords with the shingle factory to create "interesting phrases" (not really) that i use in "my more like this" needs. for example, europe for vacation europe on vacation will create the shingle europe_vacation

Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings
another add on, as the previous two were pretty much spot on:

Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Erick Erickson
The theory behind stopwords is that they are “safe” to remove when calculating relevance, so we can squeeze every last bit of usefulness out of very constrained hardware (think 64K of memory. Yes kilobytes). We’ve come a long way since then and the necessity of removing stopwords from the

Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Alexandre Rafalovitch
Stopwords (it was discussed on mailing list several times I recall): The ideas is that it used to be part of the tricks to make the index as small as possible to allow faster search. Stopwords being the most common words This days, disk space is not an issue most of the time and there have

Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Walter Underwood
Stopwords were used when we were running search engines on 16-bit computers with 50 Megabyte disks, like the PDP-11. They avoided storing and processing long posting lists. Think of removing stopwords as a binary weighting on frequent terms, either on or off (not in the index). With idf, we

Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hey Alex, Thank you! Re: stopwords being a thing of the past due to the affordability of hardware...can you expand? I'm not sure I understand. -- Audrey Lorberfeld Data Scientist, w3 Search IBM audrey.lorberf...@ibm.com On 10/8/19, 1:01 PM, "David Hastings" wrote: Another thing to

Re: Protecting Tokens from Any Analysis

2019-10-08 Thread David Hastings
Another thing to add to the above, > > IT:ibm. In this case, we would want to maintain the colon and the > capitalization (otherwise “it” would be taken out as a stopword). > stopwords are a thing of the past at this point. there is no benefit to using them now with hardware being so cheap. On

Re: Protecting Tokens from Any Analysis

2019-10-08 Thread Alexandre Rafalovitch
If you don't want it to be touched by a tokenizer, how would the protection step know that the sequence of characters you want to protect is "IT:ibm" and not "this is an IT:ibm term I want to protect"? What it sounds to me is that you may want to: 1) copyField to a second field 2) Apply a much