It's a bit of a privacy through obscurity measure, unfortunately. The problem is that American courts do a lousy job of removing social security numbers from cases that I put on my site. I do anonymization before sending the cases to Solr, but if you're clever (and the stopwords weren't in place) you could search for evidence of my anonymization efforts and then backtrack to the original cases at the court sites, where you'd find the SSNs...

It's a boondoggle, but the stopwords should help.

Mike



On Mon 09 Jan 2012 04:30:22 AM PST, Erik Hatcher wrote:
Mike -

Indeed users won't be able to *search* for things removed by the stop filter at 
index time (the terms literally aren't in the index then).  But be careful with 
the stored value.  Analysis does not affect stored content.

Are you anonymizing before sending to Solr (if so, why stop-word block?).  If 
not, if you're storing that content it could be returned to the searching 
client.   If you aren't anonymizing before sending to Solr, how are you using 
the stop word filtering to do this?

        Erik

On Jan 8, 2012, at 23:08 , Michael Lissner wrote:

I've got them configured at index and query time, so sounds like I'm all set.

I'm doing anonymization of social security numbers, converting them to 
xxx-xx-xxxx. I don't *think* users can find a way of identifying these docs if 
the stopwords-based block works.

Thank you both for the confirmation.

Mike

On Sun 08 Jan 2012 09:32:53 PM PST, Gora Mohanty wrote:
On Mon, Jan 9, 2012 at 5:03 AM, Michael Lissner
<mliss...@michaeljaylissner.com>   wrote:
I have a unique use case where I have words in my corpus that users
shouldn't ever be allowed to search for. My theory is that if I add these to
the stopwords list, that should do the trick.

Yes, that should work. Are you including the stop words at index-time,
query-time, or both? Normally, you should do both.

If done at the time of indexing, these terms will not even be in the
index, so I cannot think of any security issues.

Regards,
Gora

Reply via email to