Re: Synonym configuration not working?

2012-01-14 Thread Michael Lissner
Just replying for others in the future. The answer to this is to do 
synonyms at index time, not at query time.


Mike

On Fri 06 Jan 2012 02:35:23 PM PST, Michael Lissner wrote:
I'm trying to set up some basic synonyms. The one I've been working on 
is:


us, usa, united states

My understanding is that adding that to the synonym file will allow 
users to search for US, and get back documents containing usa or 
united states. Ditto for if a user puts in usa or united states.


Unfortunately, with this in place, when I do a search, I get the 
results for items that contain all three of the words - it's doing an 
AND of the synonyms rather than an OR.


If I turn on debugging, this is indeed what I see (plus some stemming):
(+DisjunctionMaxQuery(((westCite:us westCite:usa westCite:unit) | 
(text:us text:usa text:unit) | (docketNumber:us docketNumber:usa 
docketNumber:unit) | ((status:us status:usa status:unit)^1.25) | 
(court:us court:usa court:unit) | (lexisCite:us lexisCite:usa 
lexisCite:unit) | ((caseNumber:us caseNumber:usa 
caseNumber:unit)^1.25) | ((caseName:us caseName:usa 
caseName:unit)^1.5/no_coord


Am I doing something wrong to cause this? My defaultOperator is set to 
AND, but I'd expect the synonym filter to understand that.


Any help?

Thanks,

Mike


Re: stopwords as privacy measure

2012-01-10 Thread Michael Lissner
It's a bit of a privacy through obscurity measure, unfortunately. The 
problem is that American courts do a lousy job of removing social 
security numbers from cases that I put on my site. I do anonymization 
before sending the cases to Solr, but if you're clever (and the 
stopwords weren't in place) you could search for evidence of my 
anonymization efforts and then backtrack to the original cases at the 
court sites, where you'd find the SSNs...


It's a boondoggle, but the stopwords should help.

Mike



On Mon 09 Jan 2012 04:30:22 AM PST, Erik Hatcher wrote:

Mike -

Indeed users won't be able to *search* for things removed by the stop filter at 
index time (the terms literally aren't in the index then).  But be careful with 
the stored value.  Analysis does not affect stored content.

Are you anonymizing before sending to Solr (if so, why stop-word block?).  If 
not, if you're storing that content it could be returned to the searching 
client.   If you aren't anonymizing before sending to Solr, how are you using 
the stop word filtering to do this?

Erik

On Jan 8, 2012, at 23:08 , Michael Lissner wrote:


I've got them configured at index and query time, so sounds like I'm all set.

I'm doing anonymization of social security numbers, converting them to 
xxx-xx-. I don't *think* users can find a way of identifying these docs if 
the stopwords-based block works.

Thank you both for the confirmation.

Mike

On Sun 08 Jan 2012 09:32:53 PM PST, Gora Mohanty wrote:

On Mon, Jan 9, 2012 at 5:03 AM, Michael Lissner
mliss...@michaeljaylissner.com   wrote:

I have a unique use case where I have words in my corpus that users
shouldn't ever be allowed to search for. My theory is that if I add these to
the stopwords list, that should do the trick.


Yes, that should work. Are you including the stop words at index-time,
query-time, or both? Normally, you should do both.

If done at the time of indexing, these terms will not even be in the
index, so I cannot think of any security issues.

Regards,
Gora




Re: FastVectorHighlighter wiki corrections

2012-01-10 Thread Michael Lissner

Hi,

I didn't hear any responses here, so I went ahead and made a bunch of 
changes to the highlighting parameters wiki:
 - Highlighter is now known as Original Highlighter so it's more clear 
that Highlighter doesn't just refer to the highlighting utilities generally.
 - I need help with fragsize. The wiki says to set it to either 0 or a 
huge number to disable fragmenting. Which is it?
 - the wiki says that hl.useFastVectorHighlighter is defaulted to 
false. I read somewhere that FVH is True when the data has been indexed 
with termVectors, termPositions and termOffsets. Is that correct?


Thanks,

Mike


On 01/07/2012 10:24 PM, Michael Lissner wrote:
I switched over to the FastVectorHighlighter, but I'm struggling with 
the highlighting wiki. For example, it took me a while to figure out 
that Highlighter only means that a parameter doesn't work for FVH.


Can somebody wise tell me if the following are valid corrections I can 
make:
 - fragSize=0 can be accomplished in FVH by creating a fragListBuilder 
in your config:
fragListBuilder name=single 
class=solr.highlight.SingleFragListBuilder/

   and then calling it with hl.fragListBuilder=single
 - fragListBuilder supports field level overrides (this isn't 
mentioned currently)
 - the wiki says that hl.useFastVectorHighlighter is defaulted to 
false. I read somewhere that FVH is True when the data has been 
indexed with termVectors, termPositions and termOffsets. Is that correct?


Thanks,

Mike


Missing query operators?

2012-01-09 Thread Michael Lissner

Hi,

I'm setting up a search system that I expect lawyers to use, and I know 
they're demanding about the query operators they want. I've been looking 
around a bit, and while some of these are possible on the backend, I 
can't see how to enable them on the front end since they lack operators:


 - exact match (disabling stemming): Ideally, users need a way of 
turning this on or off for terms in their query (e.g. [ =walking running 
] would stem the word running, but not walking).
 - term quorum: This can be done via the mm parameter, but not as part 
of a query. Making it part of the query increases the flexibility of the 
parameter, since users can make queries like [ (dog cat kitten)/2 AND 
goat ], which could request documents containing two of dog, cat or 
kitten, as well as goat.
 - term order: Requesting that one term come before another doesn't 
seem to be possible, but can be very useful in some cases.


These are all possible in the Sphinx search engine, which is what I'm 
coming from, and I don't see feature requests for them in Jira. Would it 
be worth it to put these in, or is there a reason that Solr/Lucene don't 
currently support these functions within queries?


Thanks,

Mike


stopwords as privacy measure

2012-01-08 Thread Michael Lissner
I have a unique use case where I have words in my corpus that users 
shouldn't ever be allowed to search for. My theory is that if I add 
these to the stopwords list, that should do the trick.


I'm using the edismax parser and it seems to be working in my dev 
environment. Is there any risk to this approach or ways to search for a 
stopword?


My alternative approach will be to filter them myself at query time, but 
I'd like to avoid that if stopwords will work.


Thanks,

Mike


Re: stopwords as privacy measure

2012-01-08 Thread Michael Lissner
I've got them configured at index and query time, so sounds like I'm 
all set.


I'm doing anonymization of social security numbers, converting them to 
xxx-xx-. I don't *think* users can find a way of identifying these 
docs if the stopwords-based block works.


Thank you both for the confirmation.

Mike

On Sun 08 Jan 2012 09:32:53 PM PST, Gora Mohanty wrote:

On Mon, Jan 9, 2012 at 5:03 AM, Michael Lissner
mliss...@michaeljaylissner.com  wrote:

I have a unique use case where I have words in my corpus that users
shouldn't ever be allowed to search for. My theory is that if I add these to
the stopwords list, that should do the trick.


Yes, that should work. Are you including the stop words at index-time,
query-time, or both? Normally, you should do both.

If done at the time of indexing, these terms will not even be in the
index, so I cannot think of any security issues.

Regards,
Gora


Synonym configuration not working?

2012-01-06 Thread Michael Lissner

I'm trying to set up some basic synonyms. The one I've been working on is:

us, usa, united states

My understanding is that adding that to the synonym file will allow 
users to search for US, and get back documents containing usa or united 
states. Ditto for if a user puts in usa or united states.


Unfortunately, with this in place, when I do a search, I get the results 
for items that contain all three of the words - it's doing an AND of the 
synonyms rather than an OR.


If I turn on debugging, this is indeed what I see (plus some stemming):
(+DisjunctionMaxQuery(((westCite:us westCite:usa westCite:unit) | 
(text:us text:usa text:unit) | (docketNumber:us docketNumber:usa 
docketNumber:unit) | ((status:us status:usa status:unit)^1.25) | 
(court:us court:usa court:unit) | (lexisCite:us lexisCite:usa 
lexisCite:unit) | ((caseNumber:us caseNumber:usa caseNumber:unit)^1.25) 
| ((caseName:us caseName:usa caseName:unit)^1.5/no_coord


Am I doing something wrong to cause this? My defaultOperator is set to 
AND, but I'd expect the synonym filter to understand that.


Any help?

Thanks,

Mike


Re: Highlighting with prefix queries and maxBooleanClause

2012-01-06 Thread Michael Lissner
I switched over to using FastVectorHighlighting, and the problem with 
maxBooleanClause is resolved. I guess this is at the expense of having 
a larger index (since you have to enable termVectors, termPositions and 
termOffsets), but at least it's working.


Thanks for the help.

Mike

On Tue 03 Jan 2012 11:53:35 AM PST, Chris Hostetter wrote:


: About bumping MaxBooleanQueries. You can certainly
: bump it up, but it's a legitimate question whether the
: user is well served by allowing that pattern as opposed
: to requiring 2 or 3 leading characters. The assumption

i think the root of the issue here is that when executing queries, really
broad prefix queries like q=* generate constant score queries, so relaly
broad prefix queries are safe to execute.  but (based on his error) it
seems like the highlighter fails loudly an painfully on these otherwise
safe queries.

understandably, part of the reason this happens is that the highlighter
needs to know all the terms that that prefix expands to in order to know
what to highlight, but the fact that it generates an error when
maxBooleanClause is hit seems unfortunate -- maybe there is no way arround
it, but i *thought* there were options that could be used related to
highlighting to mitigate these issues, i just couldn't remember what they
are (does the FastVectorHighlighter have these problems? is it only if you
use WeightedSpanTermExtractor?) and hence my suggestion to Michael to
start a thread here in the hopes that the highlighting experts (Yeah Koji!
... better you then me!) would chime in.


-Hoss


Re: Highlighting with prefix queries and maxBooleanClause

2012-01-01 Thread Michael Lissner

On 01/01/2012 07:48 AM, Erick Erickson wrote:

This may be the impetus for Hoss creating SOLR-2996.
Yep, it is indeed, though I believe this problem can also happen when a 
user searches for something like q=a* in a big index. I need a bigger 
index to know for sure about that, but from what I've read so far, I'm 
fairly certain that this problem is bigger than just the q=* search.


I think my solution when this error is thrown is going to be to bump the 
size of the maxBooleanClause and retry the query. Failing that, I'll 
have to retry the query with highlighting off.

I suspect this will go away if you use the correct
match-all-docs syntax, i.e. q=*:* rather than q=*

It does, yes.

But I'm not sure what highlighting will do when there's
nothing to highlight on (ie, no query terms to match
against your text field).

I believe it does nothing, thankfully.

Mike


Highlighting with prefix queries and maxBooleanClause

2011-12-30 Thread Michael Lissner

This question has come up a few times, but I've yet to see a good solution.

Basically, if I have highlighting turned on and do a query for q=*, I 
get an error that maxBooleanClauses has been exceeded. Granted, this is 
a silly query, but a user might do something similar. My expectation is 
that queries that work when highlighting is OFF should continue working 
when it is ON.


What's the best solution for queries like this? Is it simply to catch 
the error and then up maxBooleanClauses? Or to turn off highlighting 
when this error occurs?


Or am I doing something altogether wrong?

This is the query I'm using to cause the error:

http://localhost:8983/solr/select/?q=*start=0rows=20hl=truehl.fl=text


Changing hl to false makes the query go through.

I'm using Solr 4.0.0-dev

The traceback is:

SEVERE: org.apache.lucene.search.BooleanQuery$TooManyClauses: 
maxClauseCount is set to 1024
at 
org.apache.lucene.search.ScoringRewrite$1.checkMaxClauseCount(ScoringRewrite.java:68)
at 
org.apache.lucene.search.ScoringRewrite$ParallelArraysTermCollector.collect(ScoringRewrite.java:159)
at 
org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:81)
at 
org.apache.lucene.search.ScoringRewrite.rewrite(ScoringRewrite.java:114)
at 
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:312)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:155)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:144)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:384)
at 
org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:216)
at 
org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:184)
at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:205)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:511)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:402)
at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:121)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1478)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at 
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)

at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)

at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)


Thanks,

Mike