It's a bit of a privacy through obscurity measure, unfortunately. The
problem is that American courts do a lousy job of removing social
security numbers from cases that I put on my site. I do anonymization
before sending the cases to Solr, but if you're clever (and the
stopwords weren't in place) you could search for evidence of my
anonymization efforts and then backtrack to the original cases at the
court sites, where you'd find the SSNs...
It's a boondoggle, but the stopwords should help.
Mike
On Mon 09 Jan 2012 04:30:22 AM PST, Erik Hatcher wrote:
Mike -
Indeed users won't be able to *search* for things removed by the stop filter at
index time (the terms literally aren't in the index then). But be careful with
the stored value. Analysis does not affect stored content.
Are you anonymizing before sending to Solr (if so, why stop-word block?). If
not, if you're storing that content it could be returned to the searching
client. If you aren't anonymizing before sending to Solr, how are you using
the stop word filtering to do this?
Erik
On Jan 8, 2012, at 23:08 , Michael Lissner wrote:
I've got them configured at index and query time, so sounds like I'm all set.
I'm doing anonymization of social security numbers, converting them to
xxx-xx-xxxx. I don't *think* users can find a way of identifying these docs if
the stopwords-based block works.
Thank you both for the confirmation.
Mike
On Sun 08 Jan 2012 09:32:53 PM PST, Gora Mohanty wrote:
On Mon, Jan 9, 2012 at 5:03 AM, Michael Lissner
<mliss...@michaeljaylissner.com> wrote:
I have a unique use case where I have words in my corpus that users
shouldn't ever be allowed to search for. My theory is that if I add these to
the stopwords list, that should do the trick.
Yes, that should work. Are you including the stop words at index-time,
query-time, or both? Normally, you should do both.
If done at the time of indexing, these terms will not even be in the
index, so I cannot think of any security issues.
Regards,
Gora