Re: Query time boosting with dismax
You can actually define boost queries to do that (bq parameter). Boost queries accept the standard Lucene query syntax and eventually appended to the user query. Just make sure that the default operator is set to OR other wise these boost queries will not only influence the boosts but also filter out some of the results. Otis Gospodnetic wrote: Terms no, but fields (with terms) and phrases, yes. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Girish Redekar girish.rede...@aplopio.com To: solr-user@lucene.apache.org Sent: Fri, December 4, 2009 11:42:16 PM Subject: Query time boosting with dismax Hi, Is it possible to weigh specific query terms with a Dismax query parser? Is it possible to write queries of the sort ... field1:(term1)^2.0 + (term2^3.0) with dismax? Thanks, Girish Redekar http://girishredekar.net
Embedded for write, HTTP for read - cache aging
Hello, Does anyone know of a way to tell an http SolrServer to reload its back-end index (mark cache as dirty) periodically? I have a scenario where an EmbeddedSolrServer is used for writing (for fast indexing), and an CommonsHttpSolrServer for reading (for remote access). If the http server is used for writing, reading clients pick up any updates, as the /update has gone 'through' the http server. For very high indexing rates, I'd rather not have to build an http request for every write (or group of writes), since the writer is always on the same machine as the index. Any help on this is much appreciated. Thanks, Peter _ View your other email accounts from your Hotmail inbox. Add them now. http://clk.atdmt.com/UKM/go/186394592/direct/01/
Re: Sanity check on numeric types and which of them to use
And what about: fieldtype name=sint class=solr.SortableIntField sortMissingLast=true/ vs. fieldtype name=bcdint class=solr.BCDIntField sortMissingLast=true/ Wich is the differenece between both? It's just bcdint always better? Thanks in advance Yonik Seeley-2 wrote: On Fri, Dec 4, 2009 at 7:38 PM, Jay Hill jayallenh...@gmail.com wrote: 1) Is there any benefit to using the int type as a TrieIntField w/ precisionStep=0 over the pint type for simple ints that won't be sorted or range queried? No. But given that people could throw in a random range query and have it work correctly with a trie based int (vs a plain int), seems reason enough to prefer it. 2) In 1.4, what type is now most efficient for sorting? trie and plain should be pretty equivalent (trie might be slightly faster to uninvert the first time). Both take up less memory in the field cache than sint. 3) The only reason to use a sint field is for backward compatibility and/or to use sortMissingFirst/SortMissingLast, correct? I believe so. -Yonik http://www.lucidimagination.com -- View this message in context: http://old.nabble.com/Sanity-check-on-numeric-types-and-which-of-them-to-use-tp26651725p26655009.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Sanity check on numeric types and which of them to use
On Sat, Dec 5, 2009 at 7:02 AM, Marc Sturlese marc.sturl...@gmail.com wrote: And what about: fieldtype name=sint class=solr.SortableIntField sortMissingLast=true/ vs. fieldtype name=bcdint class=solr.BCDIntField sortMissingLast=true/ Wich is the differenece between both? It's just bcdint always better? Thanks in advance BCDInt was a very early attempt at a sortable int type that didnt go through binary - it went directly from base 10 (the actual string representation) to a sortable base 1 (10K fits in a single char and saves memory in the fieldCache), and it also had no size limit. It's no longer referenced in any example schemas, and it doesn't have support for function queries. -Yonik http://www.lucidimagination.com
Re: Query time boosting with dismax
Are you sure about the default operator and bq? I assume we're talking about the setting in schema.xml. I think boosting queries are OR'd in automatically to the main query: From DismaxQParser#addBoostQuery() ... query.add(f, BooleanClause.Occur.SHOULD);... There is one case where query.add((BooleanClause) c); is used though. Erik On Dec 5, 2009, at 6:54 AM, Uri Boness wrote: You can actually define boost queries to do that (bq parameter). Boost queries accept the standard Lucene query syntax and eventually appended to the user query. Just make sure that the default operator is set to OR other wise these boost queries will not only influence the boosts but also filter out some of the results. Otis Gospodnetic wrote: Terms no, but fields (with terms) and phrases, yes. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Girish Redekar girish.rede...@aplopio.com To: solr-user@lucene.apache.org Sent: Fri, December 4, 2009 11:42:16 PM Subject: Query time boosting with dismax Hi, Is it possible to weigh specific query terms with a Dismax query parser? Is it possible to write queries of the sort ... field1:(term1)^2.0 + (term2^3.0) with dismax? Thanks, Girish Redekar http://girishredekar.net
Re: Query time boosting with dismax
Well.. this is mainly based on some experiments I did (not based on the code base). It appeared as if the boost queries were appended to the generated dismax query and if the default operator (in the schema) was set to AND it actually filtered out the request. For example, here's a dismax config: requestHandler name=dismax class=solr.SearchHandler default=true lst name=defaults str name=defTypedismax/str str name=qf text^0.5 name^1.0 category^1.2 /str str name=bq *category:Audio name:black* /str str name=fl *,score /str ... /requestHandler When searching with a default OR operator, you receive more results than with an AND operator. Checking out the generated query using debugQuery=true reviles the following: Generated query with default OR operator: +DisjunctionMaxQuery((category:black^1.2 | text:black^0.5 | name:black)~0.01) DisjunctionMaxQuery((category:black^1.5 | text:black^0.5 | name:black^1.2)~0.01) *category:Audio name:black* FunctionQuery((product(sint(rating),const(-1.0)))^0.5) Generated query with default AND operator: +DisjunctionMaxQuery((category:black^1.2 | text:black^0.5 | name:black)~0.01) DisjunctionMaxQuery((category:black^1.5 | text:black^0.5 | name:black^1.2)~0.01) *+category:Audio +name:black* FunctionQuery((product(sint(rating),const(-1.0)))^0.5) So when it's an AND, both clauses are marked as MUST in the overall query, which in turn filters the query. Indeed, I would expect it to add these queries as SHOULD and then the generated query would look like: +DisjunctionMaxQuery((category:black^1.2 | text:black^0.5 | name:black)~0.01) DisjunctionMaxQuery((category:black^1.5 | text:black^0.5 | name:black^1.2)~0.01) (*+category:Audio +name:black*) FunctionQuery((product(sint(rating),const(-1.0)))^0.5) Cheers, Uri Erik Hatcher wrote: Are you sure about the default operator and bq? I assume we're talking about the setting in schema.xml. I think boosting queries are OR'd in automatically to the main query: From DismaxQParser#addBoostQuery() ... query.add(f, BooleanClause.Occur.SHOULD);... There is one case where query.add((BooleanClause) c); is used though. Erik On Dec 5, 2009, at 6:54 AM, Uri Boness wrote: You can actually define boost queries to do that (bq parameter). Boost queries accept the standard Lucene query syntax and eventually appended to the user query. Just make sure that the default operator is set to OR other wise these boost queries will not only influence the boosts but also filter out some of the results. Otis Gospodnetic wrote: Terms no, but fields (with terms) and phrases, yes. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Girish Redekar girish.rede...@aplopio.com To: solr-user@lucene.apache.org Sent: Fri, December 4, 2009 11:42:16 PM Subject: Query time boosting with dismax Hi, Is it possible to weigh specific query terms with a Dismax query parser? Is it possible to write queries of the sort ... field1:(term1)^2.0 + (term2^3.0) with dismax? Thanks, Girish Redekar http://girishredekar.net
Re: Query time boosting with dismax
Checking it further by looking at the code, it seems that in most cases it indeed adds the boost queries as SHOULD. But if you define *one* bq parameter which contains a boolean query, then each clause in this boolean query will be added to the query as is. Therefore: This set up will filter the query: str name=bq +category:Audio +name:black /str This set up will *not* filter the query: str name=bq +category:Audio /str str name=bq +name:black /str So, in the first set up, the default operator as defined in the schema plays a role. Cheers, Uri Erik Hatcher wrote: Are you sure about the default operator and bq? I assume we're talking about the setting in schema.xml. I think boosting queries are OR'd in automatically to the main query: From DismaxQParser#addBoostQuery() ... query.add(f, BooleanClause.Occur.SHOULD);... There is one case where query.add((BooleanClause) c); is used though. Erik On Dec 5, 2009, at 6:54 AM, Uri Boness wrote: You can actually define boost queries to do that (bq parameter). Boost queries accept the standard Lucene query syntax and eventually appended to the user query. Just make sure that the default operator is set to OR other wise these boost queries will not only influence the boosts but also filter out some of the results. Otis Gospodnetic wrote: Terms no, but fields (with terms) and phrases, yes. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Girish Redekar girish.rede...@aplopio.com To: solr-user@lucene.apache.org Sent: Fri, December 4, 2009 11:42:16 PM Subject: Query time boosting with dismax Hi, Is it possible to weigh specific query terms with a Dismax query parser? Is it possible to write queries of the sort ... field1:(term1)^2.0 + (term2^3.0) with dismax? Thanks, Girish Redekar http://girishredekar.net
Re: Solr 1.4: StringIndexOutOfBoundsException in SpellCheckComponent with HTMLStripCharFilterFactory
Robin Wojciki wrote: I am running a search in Solr 1.4 and I am getting the StringIndexOutOfBoundsException pasted below. The spell check field uses HTMLStripCharFilterFactory. However, the search works fine if I do not use the HTMLStripCharFilterFactory. If I set a breakpoint at SpellCheckComponent.java: 248, the value of the variable best is as shown in the screenshot: http://yfrog.com/j5solrdebuginspectp At the end of first iteration, offset = 5 - (24 - 0) = -19 This causes the index out of bounds exception. The spell check field is defined as: fieldType name=text_spell class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Stack Trace: = String index out of range: -19 java.lang.StringIndexOutOfBoundsException: String index out of range: -19 at java.lang.AbstractStringBuilder.replace(Unknown Source) at java.lang.StringBuilder.replace(Unknown Source) at org.apache.solr.handler.component.SpellCheckComponent.toNamedList(SpellCheckComponent.java:248) at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:143) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) I couldn't reproduce it with simple test data. Can you open a JIRA and attach a test case that reproduces the problem with spellchecker definition in solrconfig.xml. Koji -- http://www.rondhuit.com/en/
Re: solr 1.4: multi-select for statscomponent
Is there any update on this requirement?? Britske wrote: Is there way to exclude filters from a stats field, like it is possible to exclude filters from a facet.field? It didn't work for me. i.e: I have a field price, and although I filter on price, I would like to be able to get the entire range (min,max) of prices as if I didn't specify the filter. Obviously without excluding the filter the min,max range is constrained by [50,100] Part of query: stats=truestats.field={!ex=p1}pricefq={!tag=p1}price:[50 TO 100] USE-CASE: I show a double-slider using javascript to display possible prices. (2 handles, one allowing to set min-price and the other to set max-price) The slider has a range of [0,maxprice without price filter set]. maxprice is inserted by getting info from 'stats.pricestats=true' When the user sets the slider a filter (fq) is set constraining the resultset the set min and max-prices. After the page updates, I still want to show the price-slider, with the min and max handles set to the prices as selected by the user, so the user can alter his filter quickly. However (and here it comes) I would also be able to get the 'maxprice without price filter set' because I need this to set the max-range of the slider. Is there any (undocumented) feature that makes this possible? If not, would it be easy to add? Thanks, Britske -- View this message in context: http://old.nabble.com/solr-1.4%3A-multi-select-for-statscomponent-tp22202971p26656565.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: HTML Stripping slower in Solr 1.4?
Yonik Seeley wrote: Is BaseCharFilter required for the html strip filter? -Yonik http://www.lucidimagination.com It could be if HTMLStripCharFilter is reverted to first version. The first version of HTMLStripCharFilter, for example, if we have paaa, it producesaaa (3 space chars prior to aaa). But after committed SOLR-1394, it produces aaa (1 space) and now it uses correct() method of BaseCharFilter to correct offsets. Koji -- http://www.rondhuit.com/en/
Re: Retrieving large num of docs
Hi Otis, I think my experiments are not conclusive about reduction in search time. I was playing around with various configurations to reduce the time to retrieve documents from Solr. I am sure that making the two multi valued text fields from stored to un-stored, retrieval time (query time + time to load the stored fields) became very fast. I was expecting the lazyfieldloading setting in solrconfig to take care of this but apparently it is not working as expected. Out of curiosity, I removed these 2 fields from the index (this time I am not even indexing them) and my search time got better (10 times better). However, I am still trying to isolate the reason for the search time reduction. It may be either because of 2 less fields to search in or because of the reduction in size of the index or may be something else. I am not sure if lazyfieldloading has any part in explaining this. - Raghu On Fri, Dec 4, 2009 at 3:07 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hm, hm, interesting. I was looking into something like this the other day (BIG indexed+stored text fields). After seeing enableLazyFieldLoading=true in solrconfig and after seeing fl didn't include those big fields, I though hm, so Lucene/Solr will not be pulling those large fields from disk, OK. You are saying that this may not be true based on your experiment? And what I'm calling your experiment means that you reindexed the same data, but without the 2 multi-valued text fields... .and that was the only change you made and got cca x10 search performance improvement? Sorry for repeating your words, just trying to confirm and understand. Thanks, Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Raghuveer Kancherla raghuveer.kanche...@aplopio.com To: solr-user@lucene.apache.org Sent: Thu, December 3, 2009 8:43:16 AM Subject: Re: Retrieving large num of docs Hi Hoss, I was experimenting with various queries to solve this problem and in one such test I remember that requesting only the ID did not change the retrieval time. To be sure, I tested it again using the curl command today and it confirms my previous observation. Also, enableLazyFieldLoading setting is set to true in my solrconfig. Another general observation (off topic) is that having a moderately large multi valued text field (~200 entries) in the index seems to slow down the search significantly. I removed the 2 multi valued text fields from my index and my search got ~10 time faster. :) - Raghu On Thu, Dec 3, 2009 at 2:14 AM, Chris Hostetter wrote: : I think I solved the problem of retrieving 300 docs per request for now. The : problem was that I was storing 2 moderately large multivalued text fields : though I was not retrieving them during search time. I reindexed all my : data without storing these fields. Now the response time (time for Solr to : return the http response) is very close to the QTime Solr is showing in the Hmmm two comments: 1) the example URL from your previous mail... : http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python ...doesn't match your earlier statemnet that you are only returning hte id field (there is no fl param in that URL) ... are you certain you werent' returning those large stored fields in teh response? 2) assuming you were actually using an fl param to limit the fields, make sure you have this setting in your solrconfig.xml... true ..that should make it pretty fast to return only a few fields of each document, even if you do have some jumpto stored fields that aren't being returned. -Hoss
Re: WELCOME to solr-user@lucene.apache.org
2 ways I can think of ... - ExtractingRequestHandler (this is what I am guessing you are using now) Set extractOnly=true while making a request to the extractingRequestHandler and get the parsed content back. Now make a post request on update request handler with what ever fields and field values you want. - Use HTMLStripWhiteSpaceTokenizer factory. This article may be helpful to explain what I mean. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripWhitespaceTokenizerFactory. - Raghu On Sat, Dec 5, 2009 at 3:44 AM, khalid y kern...@gmail.com wrote: Hi, I have a problem with solr. I'm indexing some html content and solr crash because my id field is multivalued. I found that Tika read the html and extract metadata like meta name=id content=12 from my htmls but my documents has an already an id setted by literal.id=10. I tried to map the id from Tika by fmap.id=ignored_ but it ignore also my literal.id I'm using solr 1.4 and tika 0.5 Someone can explain to me how I can ignore this the Tika id metadata ?? Thanks
Re: Document Decay
On Dec 4, 2009, at 1:56 AM, brad anderson wrote: Hi, I'm looking for a way to have the score of documents decay over time. I want older documents to have a lower score than newer documents. I noted the ReciprocalFloatFunction class. In an example it seemed to be doing just this when you set the function to be: recip(ms(NOW,mydatefield),3.16e-11,1,1) This is supposed to degrade the score to half its value if the mydatefield is 1 year older than the current date. My question with this is, is it making the document score go down to 0.5 or is it making the document score 1/2 of its original value. i.e. The document has score 0.8 Will the score be 0.4 or 0.5 after using this function? Actually, the value of the function gets added (it can be multiplied, too, with other params) to the score for the document. You can see this by adding a debugQuery=true value to your request which allows you to examine the explains. Also, are there better alternatives to deal with document decay? Some people like a different decay that does something like: today is better than yesterday, yesterday is better than last week and last week is better than last month, etc. (in a non-linear way). Todo this, you would need to implement your own function, I think. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: WELCOME to solr-user@lucene.apache.org
Thanks a lot for you response !! For the first solution : I need to index all the content of my websites and I want just tika ignore meta name=id because I have already an id I'll try monday and tell you if it works The second solution : Are your sure Tika use the HTML Tokenizer ? I'll check 2009/12/5 Raghuveer Kancherla raghuveer.kanche...@aplopio.com 2 ways I can think of ... - ExtractingRequestHandler (this is what I am guessing you are using now) Set extractOnly=true while making a request to the extractingRequestHandler and get the parsed content back. Now make a post request on update request handler with what ever fields and field values you want. - Use HTMLStripWhiteSpaceTokenizer factory. This article may be helpful to explain what I mean. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripWhitespaceTokenizerFactory . - Raghu On Sat, Dec 5, 2009 at 3:44 AM, khalid y kern...@gmail.com wrote: Hi, I have a problem with solr. I'm indexing some html content and solr crash because my id field is multivalued. I found that Tika read the html and extract metadata like meta name=id content=12 from my htmls but my documents has an already an id setted by literal.id=10. I tried to map the id from Tika by fmap.id=ignored_ but it ignore also my literal.id I'm using solr 1.4 and tika 0.5 Someone can explain to me how I can ignore this the Tika id metadata ?? Thanks
parsing the raw query string?
I've just found solr and am looking at what's involved to work with it. All the examples I've seen only ever use 1 word search terms being implemented as examples, which doesn't help me trying to see how multiple word queries work. It also looks like a hell of a lot of processing needs to be done on the raw query string even before you can pass it to solr (in PHP) - is everyone processing the query string first and building a custom call to solr, or is there a query string parser I've missed somewhere? I can't even find what operators (if any) are able to be used in the raw query string in the online docs (maybe there aren't any??). Any help or points in the right direction would be appreciated. -- View this message in context: http://old.nabble.com/parsing-the-raw-query-string--tp26662578p26662578.html Sent from the Solr - User mailing list archive at Nabble.com.