Re: Synonym configuration not working?
Just replying for others in the future. The answer to this is to do synonyms at index time, not at query time. Mike On Fri 06 Jan 2012 02:35:23 PM PST, Michael Lissner wrote: I'm trying to set up some basic synonyms. The one I've been working on is: us, usa, united states My understanding is that adding that to the synonym file will allow users to search for US, and get back documents containing usa or united states. Ditto for if a user puts in usa or united states. Unfortunately, with this in place, when I do a search, I get the results for items that contain all three of the words - it's doing an AND of the synonyms rather than an OR. If I turn on debugging, this is indeed what I see (plus some stemming): (+DisjunctionMaxQuery(((westCite:us westCite:usa westCite:unit) | (text:us text:usa text:unit) | (docketNumber:us docketNumber:usa docketNumber:unit) | ((status:us status:usa status:unit)^1.25) | (court:us court:usa court:unit) | (lexisCite:us lexisCite:usa lexisCite:unit) | ((caseNumber:us caseNumber:usa caseNumber:unit)^1.25) | ((caseName:us caseName:usa caseName:unit)^1.5/no_coord Am I doing something wrong to cause this? My defaultOperator is set to AND, but I'd expect the synonym filter to understand that. Any help? Thanks, Mike
Re: stopwords as privacy measure
It's a bit of a privacy through obscurity measure, unfortunately. The problem is that American courts do a lousy job of removing social security numbers from cases that I put on my site. I do anonymization before sending the cases to Solr, but if you're clever (and the stopwords weren't in place) you could search for evidence of my anonymization efforts and then backtrack to the original cases at the court sites, where you'd find the SSNs... It's a boondoggle, but the stopwords should help. Mike On Mon 09 Jan 2012 04:30:22 AM PST, Erik Hatcher wrote: Mike - Indeed users won't be able to *search* for things removed by the stop filter at index time (the terms literally aren't in the index then). But be careful with the stored value. Analysis does not affect stored content. Are you anonymizing before sending to Solr (if so, why stop-word block?). If not, if you're storing that content it could be returned to the searching client. If you aren't anonymizing before sending to Solr, how are you using the stop word filtering to do this? Erik On Jan 8, 2012, at 23:08 , Michael Lissner wrote: I've got them configured at index and query time, so sounds like I'm all set. I'm doing anonymization of social security numbers, converting them to xxx-xx-. I don't *think* users can find a way of identifying these docs if the stopwords-based block works. Thank you both for the confirmation. Mike On Sun 08 Jan 2012 09:32:53 PM PST, Gora Mohanty wrote: On Mon, Jan 9, 2012 at 5:03 AM, Michael Lissner mliss...@michaeljaylissner.com wrote: I have a unique use case where I have words in my corpus that users shouldn't ever be allowed to search for. My theory is that if I add these to the stopwords list, that should do the trick. Yes, that should work. Are you including the stop words at index-time, query-time, or both? Normally, you should do both. If done at the time of indexing, these terms will not even be in the index, so I cannot think of any security issues. Regards, Gora
Re: FastVectorHighlighter wiki corrections
Hi, I didn't hear any responses here, so I went ahead and made a bunch of changes to the highlighting parameters wiki: - Highlighter is now known as Original Highlighter so it's more clear that Highlighter doesn't just refer to the highlighting utilities generally. - I need help with fragsize. The wiki says to set it to either 0 or a huge number to disable fragmenting. Which is it? - the wiki says that hl.useFastVectorHighlighter is defaulted to false. I read somewhere that FVH is True when the data has been indexed with termVectors, termPositions and termOffsets. Is that correct? Thanks, Mike On 01/07/2012 10:24 PM, Michael Lissner wrote: I switched over to the FastVectorHighlighter, but I'm struggling with the highlighting wiki. For example, it took me a while to figure out that Highlighter only means that a parameter doesn't work for FVH. Can somebody wise tell me if the following are valid corrections I can make: - fragSize=0 can be accomplished in FVH by creating a fragListBuilder in your config: fragListBuilder name=single class=solr.highlight.SingleFragListBuilder/ and then calling it with hl.fragListBuilder=single - fragListBuilder supports field level overrides (this isn't mentioned currently) - the wiki says that hl.useFastVectorHighlighter is defaulted to false. I read somewhere that FVH is True when the data has been indexed with termVectors, termPositions and termOffsets. Is that correct? Thanks, Mike
Missing query operators?
Hi, I'm setting up a search system that I expect lawyers to use, and I know they're demanding about the query operators they want. I've been looking around a bit, and while some of these are possible on the backend, I can't see how to enable them on the front end since they lack operators: - exact match (disabling stemming): Ideally, users need a way of turning this on or off for terms in their query (e.g. [ =walking running ] would stem the word running, but not walking). - term quorum: This can be done via the mm parameter, but not as part of a query. Making it part of the query increases the flexibility of the parameter, since users can make queries like [ (dog cat kitten)/2 AND goat ], which could request documents containing two of dog, cat or kitten, as well as goat. - term order: Requesting that one term come before another doesn't seem to be possible, but can be very useful in some cases. These are all possible in the Sphinx search engine, which is what I'm coming from, and I don't see feature requests for them in Jira. Would it be worth it to put these in, or is there a reason that Solr/Lucene don't currently support these functions within queries? Thanks, Mike
stopwords as privacy measure
I have a unique use case where I have words in my corpus that users shouldn't ever be allowed to search for. My theory is that if I add these to the stopwords list, that should do the trick. I'm using the edismax parser and it seems to be working in my dev environment. Is there any risk to this approach or ways to search for a stopword? My alternative approach will be to filter them myself at query time, but I'd like to avoid that if stopwords will work. Thanks, Mike
Re: stopwords as privacy measure
I've got them configured at index and query time, so sounds like I'm all set. I'm doing anonymization of social security numbers, converting them to xxx-xx-. I don't *think* users can find a way of identifying these docs if the stopwords-based block works. Thank you both for the confirmation. Mike On Sun 08 Jan 2012 09:32:53 PM PST, Gora Mohanty wrote: On Mon, Jan 9, 2012 at 5:03 AM, Michael Lissner mliss...@michaeljaylissner.com wrote: I have a unique use case where I have words in my corpus that users shouldn't ever be allowed to search for. My theory is that if I add these to the stopwords list, that should do the trick. Yes, that should work. Are you including the stop words at index-time, query-time, or both? Normally, you should do both. If done at the time of indexing, these terms will not even be in the index, so I cannot think of any security issues. Regards, Gora
Synonym configuration not working?
I'm trying to set up some basic synonyms. The one I've been working on is: us, usa, united states My understanding is that adding that to the synonym file will allow users to search for US, and get back documents containing usa or united states. Ditto for if a user puts in usa or united states. Unfortunately, with this in place, when I do a search, I get the results for items that contain all three of the words - it's doing an AND of the synonyms rather than an OR. If I turn on debugging, this is indeed what I see (plus some stemming): (+DisjunctionMaxQuery(((westCite:us westCite:usa westCite:unit) | (text:us text:usa text:unit) | (docketNumber:us docketNumber:usa docketNumber:unit) | ((status:us status:usa status:unit)^1.25) | (court:us court:usa court:unit) | (lexisCite:us lexisCite:usa lexisCite:unit) | ((caseNumber:us caseNumber:usa caseNumber:unit)^1.25) | ((caseName:us caseName:usa caseName:unit)^1.5/no_coord Am I doing something wrong to cause this? My defaultOperator is set to AND, but I'd expect the synonym filter to understand that. Any help? Thanks, Mike
Re: Highlighting with prefix queries and maxBooleanClause
I switched over to using FastVectorHighlighting, and the problem with maxBooleanClause is resolved. I guess this is at the expense of having a larger index (since you have to enable termVectors, termPositions and termOffsets), but at least it's working. Thanks for the help. Mike On Tue 03 Jan 2012 11:53:35 AM PST, Chris Hostetter wrote: : About bumping MaxBooleanQueries. You can certainly : bump it up, but it's a legitimate question whether the : user is well served by allowing that pattern as opposed : to requiring 2 or 3 leading characters. The assumption i think the root of the issue here is that when executing queries, really broad prefix queries like q=* generate constant score queries, so relaly broad prefix queries are safe to execute. but (based on his error) it seems like the highlighter fails loudly an painfully on these otherwise safe queries. understandably, part of the reason this happens is that the highlighter needs to know all the terms that that prefix expands to in order to know what to highlight, but the fact that it generates an error when maxBooleanClause is hit seems unfortunate -- maybe there is no way arround it, but i *thought* there were options that could be used related to highlighting to mitigate these issues, i just couldn't remember what they are (does the FastVectorHighlighter have these problems? is it only if you use WeightedSpanTermExtractor?) and hence my suggestion to Michael to start a thread here in the hopes that the highlighting experts (Yeah Koji! ... better you then me!) would chime in. -Hoss
Re: Highlighting with prefix queries and maxBooleanClause
On 01/01/2012 07:48 AM, Erick Erickson wrote: This may be the impetus for Hoss creating SOLR-2996. Yep, it is indeed, though I believe this problem can also happen when a user searches for something like q=a* in a big index. I need a bigger index to know for sure about that, but from what I've read so far, I'm fairly certain that this problem is bigger than just the q=* search. I think my solution when this error is thrown is going to be to bump the size of the maxBooleanClause and retry the query. Failing that, I'll have to retry the query with highlighting off. I suspect this will go away if you use the correct match-all-docs syntax, i.e. q=*:* rather than q=* It does, yes. But I'm not sure what highlighting will do when there's nothing to highlight on (ie, no query terms to match against your text field). I believe it does nothing, thankfully. Mike
Highlighting with prefix queries and maxBooleanClause
This question has come up a few times, but I've yet to see a good solution. Basically, if I have highlighting turned on and do a query for q=*, I get an error that maxBooleanClauses has been exceeded. Granted, this is a silly query, but a user might do something similar. My expectation is that queries that work when highlighting is OFF should continue working when it is ON. What's the best solution for queries like this? Is it simply to catch the error and then up maxBooleanClauses? Or to turn off highlighting when this error occurs? Or am I doing something altogether wrong? This is the query I'm using to cause the error: http://localhost:8983/solr/select/?q=*start=0rows=20hl=truehl.fl=text Changing hl to false makes the query go through. I'm using Solr 4.0.0-dev The traceback is: SEVERE: org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024 at org.apache.lucene.search.ScoringRewrite$1.checkMaxClauseCount(ScoringRewrite.java:68) at org.apache.lucene.search.ScoringRewrite$ParallelArraysTermCollector.collect(ScoringRewrite.java:159) at org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:81) at org.apache.lucene.search.ScoringRewrite.rewrite(ScoringRewrite.java:114) at org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:312) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:155) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:144) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:384) at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:216) at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:184) at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:205) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:511) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:402) at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:121) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1478) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Thanks, Mike