Re: Solr like for autocomplete field?
I implemented the edge ngrams solution and it's an awesome one compared to any other that I could think of because I can index more than just text (other metadata) that can be used to *rank* the autocomplete results eventually getting to rank by the probability of selection which is, after all, what you want to try and maximize with such systems. On Tue, Nov 2, 2010 at 6:30 PM, Lance Norskog goks...@gmail.com wrote: And the SpellingComponent. There's nothing to help you with phrases. On Tue, Nov 2, 2010 at 11:21 AM, Erick Erickson erickerick...@gmail.com wrote: Also, you might want to consider TermsComponent, see: http://wiki.apache.org/solr/TermsComponent Also, note that there's an autosuggestcomponent, that's recently been committed. Best Erick On Tue, Nov 2, 2010 at 1:56 PM, PeterKerk vettepa...@hotmail.com wrote: I have a city field. Now when a user starts typing in a city textbox I want to return found matches (like Google). So for example, user types new, and I will return new york, new hampshire etc. my schema.xml field name=city type=string indexed=true stored=true/ my current url: http://localhost:8983/solr/db/select/?indent=onfacet=trueq=*:*start=0rows=25fl=idfacet.field=cityfq=city:new Basically 2 questions here: 1. is the url Im using the best practice when implementing autocomplete? What I wanted to do, is use the facets for found matches. 2. How can I match PART of the cityname just like the SQL LIKE command, cityname LIKE '%userinput' Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-like-for-autocomplete-field-tp1829480p1829480.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: Solr like for autocomplete field?
Have a look at ajax-solr http://evolvingweb.github.com/ajax-solr/ in the tutorial is an example of an autocompletion widget. tob From: Amit Nithian anith...@gmail.com To: solr-user@lucene.apache.org Date: 03.11.2010 07:36 Subject: Re: Solr like for autocomplete field? I implemented the edge ngrams solution and it's an awesome one compared to any other that I could think of because I can index more than just text (other metadata) that can be used to *rank* the autocomplete results eventually getting to rank by the probability of selection which is, after all, what you want to try and maximize with such systems. On Tue, Nov 2, 2010 at 6:30 PM, Lance Norskog goks...@gmail.com wrote: And the SpellingComponent. There's nothing to help you with phrases. On Tue, Nov 2, 2010 at 11:21 AM, Erick Erickson erickerick...@gmail.com wrote: Also, you might want to consider TermsComponent, see: http://wiki.apache.org/solr/TermsComponent Also, note that there's an autosuggestcomponent, that's recently been committed. Best Erick On Tue, Nov 2, 2010 at 1:56 PM, PeterKerk vettepa...@hotmail.com wrote: I have a city field. Now when a user starts typing in a city textbox I want to return found matches (like Google). So for example, user types new, and I will return new york, new hampshire etc. my schema.xml field name=city type=string indexed=true stored=true/ my current url: http://localhost:8983/solr/db/select/?indent=onfacet=trueq=*:*start=0rows=25fl=idfacet.field=cityfq=city:new Basically 2 questions here: 1. is the url Im using the best practice when implementing autocomplete? What I wanted to do, is use the facets for found matches. 2. How can I match PART of the cityname just like the SQL LIKE command, cityname LIKE '%userinput' Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-like-for-autocomplete-field-tp1829480p1829480.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com _ Sachsen DV Betriebs- und Servicegesellschaft mbH Täubchenweg 26 04317 Leipzig Amtsgericht Leipzig, HRB 18545 Geschäftsführer: Herbert Roller Brandão, Dr. Jean-Michael Pfitzner Aufsichtsratsvorsitzender: Sven Petersen
RE: Searching Across Multiple Cores
Sorry about the late response to this, but was on holidays. No, as of right now there is not the same schema in each shard. I need to be able to search a set of data resources with manually defined data fields. All of those fields are searchable. Any one of these resources can be added to an individual's favourites list with the possibility of them adding additional tags, which are also searchable. The favourites folder needs to be searchable on all the same fields as the main data set and on the additional user defined tags. Search fields for the main data schema are: resourceId resourceType resourceGradeLevel resourceKeywords resourceLength resourceSubjectArea and about 30 more fields The searchable fields for the My Favourites schema are: userId userFolder userDefinedGradeLevel userDefinedTags plus all of those in the main data set. Search queries: 1. Search the main data set for all those resources with keyword 'foo'. 2. Search the main data set for all those resources with keyword 'foo' and are for grade 3. 3. Search the main data set for all those resources with subject area of 'grammar'. 4. Search My Favourites folder for all the resources I have moved there (userId = 12321) with the keyword 'foo'. 5. Search My Favourites folder for all the resources I have moved there (userId = 12321) with the keyword 'foo' and are for grade 3 and are in the folder 'testing'. 6. Search My Favourites folder for all the resources I have moved there (userId = 12321) with the subject area of 'grammar' and I have tagged with 'interesting'. 7. Various combinations of the above. The simplest way I came up with to do this is to have 2 separate schemas. One for the main data set and one for My Favourites. When someone adds a resource from the main data set to their My Favourites folder all the data from the main data set is copied over the My Favourites schema and the userId, folder and other user specific information is added also. But there could be 1 million copies of basically the same data in the My Favourites (if 1 million users add the same resource to their favourites). I thought that would waste a lot of space, so was looking for another way to do this (using a type of join - see below). Are there any other possibilities? Cheers, Steve -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: 14 October 2010 18:58 To: solr-user@lucene.apache.org Subject: Re: Searching Across Multiple Cores The point/use-case of sharding/distributed search is for performance, not for segregating different data in different places. Distributed search assumes the same schema in each shard -- do you have that? I don't think distributed search means to support the kind of joining you describe, that's not really what Solr does. But if you actually do have the same schema accross your shards, and have distributed search set up properly -- then you don't need to do any special joining, the shards end up forming one 'logical' index, that's the point of it. I don't think you can do what you describe. Solr doesn't do joins like an rdbms, Solr works on a single set of documents, not multiple tables or collections. If you describe your data and the kind of queries you want to run, someone might be able to figure out a way to de-normalize the data to support what you want to do. Which won't really have anything to do with shards/distributed search -- you add in distributed search for performance or giant-size-of-index purposes, but it doesn't change your schema design or queries. Lohrenz, Steven wrote: Ken, Ok, I understand how the distributed search works, but I don't understand how to build my query appropriately so that the results returned from the two shards only return values that exist in both result sets. In essence, I'm doing a join across the two shards on the resourceId. So Core0 has: resourceId (unique key) title tag1 tag2 tag3 And Core1 has: resourceId + folder + userId + grade (concatenated - this is the uniqueId) resourceId folder userId grade For example, I would want to find all the content with userId = 893489 and tag1 = 'contentTagX'. My thought of how to do this is to search Core1 for all the items with userId = 893489. This would return a set of results for that user with resourceId. Then I would need to search Core0 for where tag1 = 'contentTagX' and where resourceId = those returned in the result set from Core1. I can probably do this in a search handler (say Core3 with a mashup of the 2 schemas but just redirects to the other shards), but is there an easier way to do it? Or am I missing something? Thanks for your help, Steve -Original Message- From: Ken Stanley [mailto:doh...@gmail.com] Sent: 14 October 2010 18:19 To: solr-user@lucene.apache.org Subject: Re: Searching Across Multiple Cores Steve, Using shards is actually quite simple; it's just a matter of setting
Re: Updating last_modified field when using DIH
Juan, that's correct .. solr will not touch your database, that's part of your application-code. solr uses an updated timestamp (which is available through dataimporter.last_index_time). so, image the following situation, solr import runs every 10 minutes .. last run at 11:00, your entity gets updated at 11:03, next solr-run at 11:10 will detect this as changed, import the entity and run again at 11:20 .. then, no entity will match the delta-query because solr will ask for a modification_date 11:10 (last solr-run at this time). you'll only need to update the last_modified field (in your application) when the entity is changed and you want solr to (re-)index your data. HTH, Stefan On Tue, Nov 2, 2010 at 7:35 PM, Juan Manuel Alvarez naici...@gmail.comwrote: Hello everyone! I would like to ask you a question about DIH and delta import. I am trying to sync Solr with a PostgreSQL database and I have a field ent_lastModified of type timestamp without timezone. Here is my xml file: dataConfig dataSource name=jdbc driver=org.postgresql.Driver url=jdbc:postgresql://host user=XXX password=XXX readOnly=true autoCommit=false transactionIsolation=TRANSACTION_READ_COMMITTED holdability=CLOSE_CURSORS_AT_COMMIT/ document entity name='myEntity' dataSource='jdbc' pk='id' query=' SELECT * FROM Entities' deltaImportQuery='SELECT ent_id AS id FROM Entities WHERE ent_id=${dataimporter.delta.id}' deltaQuery=' SELECT ent_id AS id FROM Entities WHERE ent_lastModified gt; #39;${dataimporter.last_index_time}#39;' /entity /document /dataConfig Full-import works fine, but when I run a delta-import the ent_lastModified field, I get the corresponding records, but the ent_lastModified stays the same, so if I make another delta-import, the same records are retreived. I have read all the documentation at http://wiki.apache.org/solr/DataImportHandler but I could not find an update query for the last_modified field and Solr does not seem to do this automatically. I have also tried to name the field last_modified as in the example, but its value keeps unchanged after a delta-import. Can anyone point me in the right direction? Thanks in advance! Juan M.
RE: Updating last_modified field when using DIH
Also, your deltaImportQuery should be: deltaImportQuery='SELECT * FROM Entities WHERE ent_id=${dataimporter.delta.id}' Otherwise you're just importing the ids and not the rest of the data. If performance is important to you, you might also want to check out http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3 c9f8b39cb3b7c6d4594293ea29ccf438b01702...@icq-mail.icq.il.office.aol.com %3E Ephraim Ofir -Original Message- From: Stefan Matheis [mailto:matheis.ste...@googlemail.com] Sent: Wednesday, November 03, 2010 12:58 PM To: solr-user@lucene.apache.org Subject: Re: Updating last_modified field when using DIH Juan, that's correct .. solr will not touch your database, that's part of your application-code. solr uses an updated timestamp (which is available through dataimporter.last_index_time). so, image the following situation, solr import runs every 10 minutes .. last run at 11:00, your entity gets updated at 11:03, next solr-run at 11:10 will detect this as changed, import the entity and run again at 11:20 .. then, no entity will match the delta-query because solr will ask for a modification_date 11:10 (last solr-run at this time). you'll only need to update the last_modified field (in your application) when the entity is changed and you want solr to (re-)index your data. HTH, Stefan On Tue, Nov 2, 2010 at 7:35 PM, Juan Manuel Alvarez naici...@gmail.comwrote: Hello everyone! I would like to ask you a question about DIH and delta import. I am trying to sync Solr with a PostgreSQL database and I have a field ent_lastModified of type timestamp without timezone. Here is my xml file: dataConfig dataSource name=jdbc driver=org.postgresql.Driver url=jdbc:postgresql://host user=XXX password=XXX readOnly=true autoCommit=false transactionIsolation=TRANSACTION_READ_COMMITTED holdability=CLOSE_CURSORS_AT_COMMIT/ document entity name='myEntity' dataSource='jdbc' pk='id' query=' SELECT * FROM Entities' deltaImportQuery='SELECT ent_id AS id FROM Entities WHERE ent_id=${dataimporter.delta.id}' deltaQuery=' SELECT ent_id AS id FROM Entities WHERE ent_lastModified gt; #39;${dataimporter.last_index_time}#39;' /entity /document /dataConfig Full-import works fine, but when I run a delta-import the ent_lastModified field, I get the corresponding records, but the ent_lastModified stays the same, so if I make another delta-import, the same records are retreived. I have read all the documentation at http://wiki.apache.org/solr/DataImportHandler but I could not find an update query for the last_modified field and Solr does not seem to do this automatically. I have also tried to name the field last_modified as in the example, but its value keeps unchanged after a delta-import. Can anyone point me in the right direction? Thanks in advance! Juan M.
Re: Query question
My impression was that city:Chicago^10 +Romantic +View would do what you want (with the standard lucene query parser and default operator OR), and I'm not sure about this, but I have a feeling that the version with Boolean operators AND/OR and parens might actually net out to the same thing, since under the hood all the terms have to be translated into optional, required or forbidden: lucene doesn't actually have true binary boolean operators. At least that was the impression I got after some discussion at a recent conference. I may have misunderstood - if so, could someone who knows set me straight? Yes, you are completely right. If the default operator is set to OR, your query would do the trick. And it is better to use and think in terms of unitary operators.
Re: Query question
Unfortunately the default operator is set to AND and I can't change that at this time. If I do (city:Chicago^10 OR Romantic OR View) it returns way too many unwanted results. If I do (city:Chicago^10 OR (Romantic AND View)) it returns less unwanted results, but still a lot. iorixxx's solution of (Romantic AND View AND (city:Chicago^10 OR (*:* -city:Chicago))) does seem to work. Chicago results are at the top, and the remaining results seem to fit the other search parameters. It's an ugly query, but does seem to do the trick for now until I master Dismax. Thanks all! -- View this message in context: http://lucene.472066.n3.nabble.com/Query-question-tp1828367p1834793.html Sent from the Solr - User mailing list archive at Nabble.com.
Core status uptime and startTime
As far as I know, in the core admin page you can find when was the last time an index had a modification and was comitted checking the lastModified. But? what startTime and uptime mean? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/Core-status-uptime-and-startTime-tp1834806p1834806.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query question
Another alternative (prettier to my eye), would be: (city:Chicago AND Romantic AND View)^10 OR (Romantic AND View) -Mike On 11/03/2010 09:28 AM, kenf_nc wrote: Unfortunately the default operator is set to AND and I can't change that at this time. If I do (city:Chicago^10 OR Romantic OR View) it returns way too many unwanted results. If I do (city:Chicago^10 OR (Romantic AND View)) it returns less unwanted results, but still a lot. iorixxx's solution of (Romantic AND View AND (city:Chicago^10 OR (*:* -city:Chicago))) does seem to work. Chicago results are at the top, and the remaining results seem to fit the other search parameters. It's an ugly query, but does seem to do the trick for now until I master Dismax. Thanks all!
Corename after Swap in MultiCore
Hi everyone, Long question but please hold on. I'm using a multicore Solr instance to index different documents from different sources( around 4) and I'm using a common config for all the cores. So, for each source I have core and temp core like 'doc' and 'doc-temp'. So, everytime I want to get new data, I do dataimport to the temp core and then swap the cores. For swaping I'm using the postCommit event listener to make sure the swap is done after the completing commit. After the first swap when I use solr.core.name on the doc-temp it is returning doc as its name ( because the commit is done on the doc's data dir after the first swap ). How do I get the core name of the doc-temp here in order to swap again with .swap ? I'm stuck here. Please help me. Also if anyone know for sure if a dataimport is being done on a core then the next swap query will be executed only after this dataimport is finished? Thanks in advance. Ram. -- View this message in context: http://lucene.472066.n3.nabble.com/Corename-after-Swap-in-MultiCore-tp1835325p1835325.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Query question
Another option is to override the default operator in the query. {!lucene q.op=OR}city:Chicago^10 +Romantic +View Colin. -Original Message- From: Mike Sokolov [mailto:soko...@ifactory.com] Sent: Wednesday, November 03, 2010 9:42 AM To: solr-user@lucene.apache.org Cc: kenf_nc Subject: Re: Query question Another alternative (prettier to my eye), would be: (city:Chicago AND Romantic AND View)^10 OR (Romantic AND View) -Mike On 11/03/2010 09:28 AM, kenf_nc wrote: Unfortunately the default operator is set to AND and I can't change that at this time. If I do (city:Chicago^10 OR Romantic OR View) it returns way too many unwanted results. If I do (city:Chicago^10 OR (Romantic AND View)) it returns less unwanted results, but still a lot. iorixxx's solution of (Romantic AND View AND (city:Chicago^10 OR (*:* -city:Chicago))) does seem to work. Chicago results are at the top, and the remaining results seem to fit the other search parameters. It's an ugly query, but does seem to do the trick for now until I master Dismax. Thanks all!
Re: Influencing scores on values in multiValue fields
Be careful of multi-term queries and String types. By multi-term here, I mean multi-term according to the 'pre-tokenization' that dismax and standard parsers do -- basically on whitespace. If you have a string with whitespace as a single (non-tokenized field) in a Solr String type, and you have a q that is that identical string (with whitespace, but NOT enclosed in phrase quotes) -- it still won't match. Because of the pre-tokenization-on-whitespace that the query parsers do. It WILL still match if you put the q in double quotes for a phrase. And it WILL still match for a dismax pf phrase boost. But it will not match a dismax qf field, or a standard query parser fielded q search. This makes this approach to solving the problem not always do what you'd like. I haven't figured out a better one though. With dismax, if you include it both as a boosted field in qf (which will match on single-term queries, but not on queries with whitespace) AND as a boosted field in pf (which will match on queries with whitespace, but wont' be used at all for queries without whitespace, as dismax doesn't even bring the pf into play unless the pre-tokenization comes up with more than one term) -- it seems to mostly do what you'd want. An alternate strategy might be trying to use it as a dismax bq query, since you can tell bq to use an alternate query parser (for example !field or !raw) that won't do the pre-tokenization. Imran wrote: Thanks Mike for your suggestion. It did take me down the correct route. I basically created another multiValue field of type 'string' and boosted that. To get the partial matches to avoid the length normalisation I had the 'text' type multiValue field to omitNorms. The results look as per expected so far on this configuration. Cheers -- Imran On Fri, Oct 29, 2010 at 1:09 PM, Michael Sokolov soko...@ifactory.comwrote: How about creating another field for doing exact matches (a string); searching both and boosting the string match? -Mike -Original Message- From: Imran [mailto:imranboho...@gmail.com] Sent: Friday, October 29, 2010 6:25 AM To: solr-user@lucene.apache.org Subject: Influencing scores on values in multiValue fields Hi All We've got an index in which we have a multiValued field per document. Assume the multivalue field values in each document to be; Doc1: bar lifters Doc2: truck tires back drops bar lifters Doc 3: iron bar lifters Doc 4: brass bar lifters iron bar lifters tire something truck something oil gas Now when we search for 'bar lifters' the expectation (based on the requirements) is that we get results in the order of Doc1, Doc 2, Doc4 and Doc3. Doc 1 - since there's an exact match (and only one) for the search terms Doc 2 - since ther'e an exact match amongst the values Doc 4 - since there's a partial match on the values but the number of matches are more than Doc 3 Doc 3 - since there's a partial match However, the results come out as Doc1, Doc3, Doc2, Doc4. Looking at the explaination of the result it appears Doc 2 is loosing to Doc3 and Doc 4 is loosing to Doc3 based on length normalisation. We think we can see the reason for that - the field length in doc2 is greater than doc3 and doc 4 is greater doc3. However, is there any mechanism I can force doc2 to beat doc3 and doc4 to beat doc3 with this structure. We did look at using omitNorms=true, but that messes up the scores for all docs. The result comes out as Doc4, Doc1, Doc2, Doc3 (where Doc1, Doc2 and Doc3 gets the same score) This is because the fieldNorm is not taken into account anymore (as expected) and the termFrequence being the only contributing factor. So trying to avoid length normalisation through omitNorms is not helping. Is there anyway where we can influence an exact match of a value in a multiValue field to add on to the overall score whilst keeping the lenght normalisation? Hope that makes sense. Cheers -- Imran
Re: Searching Across Multiple Cores
Basically, Solr doesn't do that. It seems to be a frequent topic on the listserv, people wanting Solr to be able to do something like that. But, as far as I know, it doesn't -- and I don't have a good idea of alternate ways to solve that kind of problem either. Try put everything in the same core, is the general answer. Solr shard distribution is designed for performance scaling, not for accomplishing join like behavior accross two different schemas, the distribution/shard thing isn't going to get you to that. Lohrenz, Steven wrote: Sorry about the late response to this, but was on holidays. No, as of right now there is not the same schema in each shard. I need to be able to search a set of data resources with manually defined data fields. All of those fields are searchable. Any one of these resources can be added to an individual's favourites list with the possibility of them adding additional tags, which are also searchable. The favourites folder needs to be searchable on all the same fields as the main data set and on the additional user defined tags. Search fields for the main data schema are: resourceId resourceType resourceGradeLevel resourceKeywords resourceLength resourceSubjectArea and about 30 more fields The searchable fields for the My Favourites schema are: userId userFolder userDefinedGradeLevel userDefinedTags plus all of those in the main data set. Search queries: 1. Search the main data set for all those resources with keyword 'foo'. 2. Search the main data set for all those resources with keyword 'foo' and are for grade 3. 3. Search the main data set for all those resources with subject area of 'grammar'. 4. Search My Favourites folder for all the resources I have moved there (userId = 12321) with the keyword 'foo'. 5. Search My Favourites folder for all the resources I have moved there (userId = 12321) with the keyword 'foo' and are for grade 3 and are in the folder 'testing'. 6. Search My Favourites folder for all the resources I have moved there (userId = 12321) with the subject area of 'grammar' and I have tagged with 'interesting'. 7. Various combinations of the above. The simplest way I came up with to do this is to have 2 separate schemas. One for the main data set and one for My Favourites. When someone adds a resource from the main data set to their My Favourites folder all the data from the main data set is copied over the My Favourites schema and the userId, folder and other user specific information is added also. But there could be 1 million copies of basically the same data in the My Favourites (if 1 million users add the same resource to their favourites). I thought that would waste a lot of space, so was looking for another way to do this (using a type of join - see below). Are there any other possibilities? Cheers, Steve -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: 14 October 2010 18:58 To: solr-user@lucene.apache.org Subject: Re: Searching Across Multiple Cores The point/use-case of sharding/distributed search is for performance, not for segregating different data in different places. Distributed search assumes the same schema in each shard -- do you have that? I don't think distributed search means to support the kind of joining you describe, that's not really what Solr does. But if you actually do have the same schema accross your shards, and have distributed search set up properly -- then you don't need to do any special joining, the shards end up forming one 'logical' index, that's the point of it. I don't think you can do what you describe. Solr doesn't do joins like an rdbms, Solr works on a single set of documents, not multiple tables or collections. If you describe your data and the kind of queries you want to run, someone might be able to figure out a way to de-normalize the data to support what you want to do. Which won't really have anything to do with shards/distributed search -- you add in distributed search for performance or giant-size-of-index purposes, but it doesn't change your schema design or queries. Lohrenz, Steven wrote: Ken, Ok, I understand how the distributed search works, but I don't understand how to build my query appropriately so that the results returned from the two shards only return values that exist in both result sets. In essence, I'm doing a join across the two shards on the resourceId. So Core0 has: resourceId (unique key) title tag1 tag2 tag3 And Core1 has: resourceId + folder + userId + grade (concatenated - this is the uniqueId) resourceId folder userId grade For example, I would want to find all the content with userId = 893489 and tag1 = 'contentTagX'. My thought of how to do this is to search Core1 for all the items with userId = 893489. This would return a set of results for that user with resourceId. Then I would need to search Core0 for where tag1 = 'contentTagX' and where resourceId = those returned in the
Re: Possible memory leaks with frequent replication
I hadn't looked at the code, am not familiar with Solr code, and can't say what that code does. But I have experienced issues that I _believe_ were caused by too frequent commits causing over-lapping searcher preperation. And I've definitely seen Solr documentation that suggests this is an issue. Let me find it now to see if the experts think these documented suggests are still correct or not: On the other hand, autowarming (populating) a new collection could take a lot of time, especially since it uses only one thread and one CPU. If your settings fire off snapinstaller too frequently, then a Solr slave could be in the undesirable condition of handing-off queries to one (old) collection, and, while warming a new collection, a second “new” one could be snapped and begin warming! If we attempted to solve such a situation, we would have to invalidate the first “new” collection in order to use the second one, then when a “third” new collection would be snapped and warmed, we would have to invalidate the “second” new collection, and so on ad infinitum. A completely warmed collection would never make it to full term before it was aborted. This can be prevented with a properly tuned configuration so new collections do not get installed too rapidly. http://wiki.apache.org/solr/SolrPerformanceFactors#Updates_and_Commit_Frequency_Tradeoffs I think I've seen that same advice on another wiki page without being specifically regarding replication, but just being about commit frequency balanced with auto-warming, leading to overlapping warming, leading to spiraling RAM/CPU usage -- but NOT an exception being thrown or HTTP error delivered. I can't find it on the wiki, but here's a listserv post with someone reporting findings that match my understanding: http://osdir.com/ml/solr-user.lucene.apache.org/2010-09/msg00528.html How does this advice square with the code Lance found? Is my understanding of how frequent commits can interact with time it takes to warm a new collection correct? Appreciate any additional info. Lance Norskog wrote: Isn't that what this code does? onDeckSearchers++; if (onDeckSearchers 1) { // should never happen... just a sanity check log.error(logid+ERROR!!! onDeckSearchers is + onDeckSearchers); onDeckSearchers=1; // reset } else if (onDeckSearchers maxWarmingSearchers) { onDeckSearchers--; String msg=Error opening new searcher. exceeded limit of maxWarmingSearchers=+maxWarmingSearchers + , try again later.; log.warn(logid++ msg); // HTTP 503==service unavailable, or 409==Conflict throw new SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE,msg,true); } else if (onDeckSearchers 1) { log.info(logid+PERFORMANCE WARNING: Overlapping onDeckSearchers= + onDeckSearchers); } On Tue, Nov 2, 2010 at 10:02 AM, Jonathan Rochkind rochk...@jhu.edu wrote: It's definitely a known 'issue' that you can't replicate (or do any other kind of index change, including a commit) at a faster frequency than your warming queries take to complete, or you'll wind up with something like you've seen. It's in some documentation somewhere I saw, for sure. The advice to 'just query against the master' is kind of odd, because, then... why have a slave at all, if you aren't going to query against it? I guess just for backup purposes. But even with just one solr, or querying master, if you commit at rate such that commits come before the warming queries can complete, you're going to have the same issue. The only answer I know of is Don't commit (or replicate) at a faster rate than it takes your warming to complete. You can reduce your warming queries/operations, or reduce your commit/replicate frequency. Would be interesting/useful if Solr noticed this going on, and gave you some kind of error in the log (or even an exception when started with a certain parameter for testing) Overlapping warming queries, you're committing too fast or something. Because it's easy to make this happen without realizing it, and then your Solr does what Simon says, runs out of RAM and/or uses a whole lot of CPU and disk io. Lance Norskog wrote: You should query against the indexer. I'm impressed that you got 5s replication to work reliably. On Mon, Nov 1, 2010 at 4:27 PM, Simon Wistow si...@thegestalt.org wrote: We've been trying to get a setup in which a slave replicates from a master every few seconds (ideally every second but currently we have it set at every 5s). Everything seems to work fine until, periodically, the slave just stops responding from what looks like it running out of memory: org.apache.catalina.core.StandardWrapperValve invoke SEVERE: Servlet.service() for servlet jsp threw exception java.lang.OutOfMemoryError: Java heap space (our monitoring seems to confirm this). Looking around my suspicion is that it takes new Readers longer to warm than
Re: Possible memory leaks with frequent replication
Ah, but reading Peter's email message I reference more carefully, it seems that Solr already DOES provide an info-level log warning you about over-lapping warming, awesome. (But again, I'm pretty sure it does NOT throw or HTTP error in that condition, based on my and others experience). To check if your Solr environment is suffering from this, turn on INFO level logging, and look for: 'PERFORMANCE WARNING: Overlapping onDeckSearchers=x'. Sweet, good to know, and I'll definitely add this to my debugging toolbox. Peter's listserv message really ought to be a wiki page, I think. Any reason for me not to just add it as a new one with title Commit frequency and auto-warming or something like that? Unless it's already in the wiki somewhere I haven't found, assuming the wiki will let an ordinary user-created account add a new page. // Jonathan Rochkind wrote: I hadn't looked at the code, am not familiar with Solr code, and can't say what that code does. But I have experienced issues that I _believe_ were caused by too frequent commits causing over-lapping searcher preperation. And I've definitely seen Solr documentation that suggests this is an issue. Let me find it now to see if the experts think these documented suggests are still correct or not: On the other hand, autowarming (populating) a new collection could take a lot of time, especially since it uses only one thread and one CPU. If your settings fire off snapinstaller too frequently, then a Solr slave could be in the undesirable condition of handing-off queries to one (old) collection, and, while warming a new collection, a second “new” one could be snapped and begin warming! If we attempted to solve such a situation, we would have to invalidate the first “new” collection in order to use the second one, then when a “third” new collection would be snapped and warmed, we would have to invalidate the “second” new collection, and so on ad infinitum. A completely warmed collection would never make it to full term before it was aborted. This can be prevented with a properly tuned configuration so new collections do not get installed too rapidly. http://wiki.apache.org/solr/SolrPerformanceFactors#Updates_and_Commit_Frequency_Tradeoffs I think I've seen that same advice on another wiki page without being specifically regarding replication, but just being about commit frequency balanced with auto-warming, leading to overlapping warming, leading to spiraling RAM/CPU usage -- but NOT an exception being thrown or HTTP error delivered. I can't find it on the wiki, but here's a listserv post with someone reporting findings that match my understanding: http://osdir.com/ml/solr-user.lucene.apache.org/2010-09/msg00528.html How does this advice square with the code Lance found? Is my understanding of how frequent commits can interact with time it takes to warm a new collection correct? Appreciate any additional info. Lance Norskog wrote: Isn't that what this code does? onDeckSearchers++; if (onDeckSearchers 1) { // should never happen... just a sanity check log.error(logid+ERROR!!! onDeckSearchers is + onDeckSearchers); onDeckSearchers=1; // reset } else if (onDeckSearchers maxWarmingSearchers) { onDeckSearchers--; String msg=Error opening new searcher. exceeded limit of maxWarmingSearchers=+maxWarmingSearchers + , try again later.; log.warn(logid++ msg); // HTTP 503==service unavailable, or 409==Conflict throw new SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE,msg,true); } else if (onDeckSearchers 1) { log.info(logid+PERFORMANCE WARNING: Overlapping onDeckSearchers= + onDeckSearchers); } On Tue, Nov 2, 2010 at 10:02 AM, Jonathan Rochkind rochk...@jhu.edu wrote: It's definitely a known 'issue' that you can't replicate (or do any other kind of index change, including a commit) at a faster frequency than your warming queries take to complete, or you'll wind up with something like you've seen. It's in some documentation somewhere I saw, for sure. The advice to 'just query against the master' is kind of odd, because, then... why have a slave at all, if you aren't going to query against it? I guess just for backup purposes. But even with just one solr, or querying master, if you commit at rate such that commits come before the warming queries can complete, you're going to have the same issue. The only answer I know of is Don't commit (or replicate) at a faster rate than it takes your warming to complete. You can reduce your warming queries/operations, or reduce your commit/replicate frequency. Would be interesting/useful if Solr noticed this going on, and gave you some kind of error in the log (or even an exception when started with a certain parameter for testing) Overlapping warming queries, you're committing too fast or something. Because it's easy to make this happen
Re: A bug in ComplexPhraseQuery ?
iorixxx wrote: I added this change to SOLR-1604, can you test it give us feedback? Hi, Sorry for the delay. We have tested the change and it is OK for this. However, we have found that this query is crashing when using CoomplexPhraseQuery: sulfur-reducing bacteria It is due to the dash inside the phrase. Here is the trace: java.lang.IllegalArgumentException: Unknown query type org.apache.lucene.search.PhraseQuery found in phrase query string sulfur-reducing bacteria at org.apache.lucene.queryParser.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:290) at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:438) at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:311) at org.apache.lucene.search.Query.weight(Query.java:98) at org.apache.lucene.search.Searcher.createWeight(Searcher.java:230) at org.apache.lucene.search.Searcher.search(Searcher.java:171) at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988) ... Regards Jean-Michel -- View this message in context: http://lucene.472066.n3.nabble.com/A-bug-in-ComplexPhraseQuery-tp1744659p1835918.html Sent from the Solr - User mailing list archive at Nabble.com.
Override SynonymFilterFactory to load synonyms from alternate data source
Hi all, Can anyone comment on the ease/merit of overriding the shipped SynonymFilterFactory with a version that could load the synonyms from an alternate data source? Our application currently maintains synonyms in its database ; we could export this data to 'synonyms.txt', but would prefer a db aware implementationv of SynonymFilterFactory, i.e. avoiding that middle step. From the looks of the class (private instances, static methods), it doesn't lend itself to easy subclassing.. Any comments or recommendations? thanks will
Re: Override SynonymFilterFactory to load synonyms from alternate data source
Our application currently maintains synonyms in its database ; we could export this data to 'synonyms.txt', but would prefer a db aware implementationv of SynonymFilterFactory, i.e. avoiding that middle step. From the looks of the class (private instances, static methods), it doesn't lend itself to easy subclassing.. just write your own DataBaseSynonymFilterFactory that loads the synonyms from your db using your custom logic and then constructs the SynonymFilter objects like the existing factory [1] [1] http://search-lucene.com/m/Av4xC1PtNLW1/
Negative or zero value for fieldNorm
Hi all, I've got some puzzling issue here. During tests i noticed a document at the bottom of the results where it should not be. I query using DisMax on title and content field and have a boost on title using qf. Out of 30 results, only two documents also have the term in the title. Using debugQuery and fl=*,score i quickly noticed large negative maxScore of the complete resultset and a portion of the resultset where scores sum up to zero because of a product with 0 (fieldNorm). See below for debug output for a result with score = 0: 0.0 = (MATCH) sum of: 0.0 = (MATCH) max of: 0.0 = (MATCH) weight(content:kunstgrasveld in 7), product of: 0.75658196 = queryWeight(content:kunstgrasveld), product of: 6.6516633 = idf(docFreq=33, maxDocs=9682) 0.113743275 = queryNorm 0.0 = (MATCH) fieldWeight(content:kunstgrasveld in 7), product of: 2.236068 = tf(termFreq(content:kunstgrasveld)=5) 6.6516633 = idf(docFreq=33, maxDocs=9682) 0.0 = fieldNorm(field=content, doc=7) 0.0 = (MATCH) fieldWeight(title:kunstgrasveld in 7), product of: 1.0 = tf(termFreq(title:kunstgrasveld)=1) 8.791729 = idf(docFreq=3, maxDocs=9682) 0.0 = fieldNorm(field=title, doc=7) And one with a negative score: 3.0716116E-4 = (MATCH) sum of: 3.0716116E-4 = (MATCH) max of: 3.0716116E-4 = (MATCH) weight(content:kunstgrasveld in 1462), product of: 0.75658196 = queryWeight(content:kunstgrasveld), product of: 6.6516633 = idf(docFreq=33, maxDocs=9682) 0.113743275 = queryNorm 4.059853E-4 = (MATCH) fieldWeight(content:kunstgrasveld in 1462), product of: 1.0 = tf(termFreq(content:kunstgrasveld)=1) 6.6516633 = idf(docFreq=33, maxDocs=9682) 6.1035156E-5 = fieldNorm(field=content, doc=1462) There are no funky issues with term analysis for the text fieldType, in fact, the term passes through unchanged. I don't do omitNorms, i store termVectors etc. Because fieldNorm = fieldBoost / sqrt(numTermsForField) i suspect my input from Nutch is messed up. A fieldNorm can never be = 0 for a normal positive boost and field boosts should not be zero or negative (correct me if i'm wrong). But, since i can't yet figure out what field boosts Nutch sends to me i thought i'd drop by on this mailing list first. There are quite a few query terms that return with zero or negative scores and many that behave as i expect. I find it also a bit hard to comprehend why the docs with negative score rank higher in the result set than documents with zero score. Sorting defaults to score DESC, but this is perhaps another issue. Anyway, the test runs on a Solr 1.4.1 instance with Java 6 under the hood. Help or directions are appreciated =) Cheers, -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
blacklist docs by uniqueKey
Hello, I have a single core servicing 3 different applications, one of the application doesnt want some specific docs to show up (driven by Editorial decision). Over a period of time the amount of blacklisted docs could grow, hence I do not want to restrict them in a query as it the query could get extremely large. Is there a configuration option where we can blacklist ids (uniqueKey) from showing up in results. Is there anything similar to EvelationComponent that demotes docs ? This could be ideal. I tried to look up and see if there was a boosting option in elevation component so that I could negatively boost certain docs but could not find any. Can anybody kindly point me in the right direction. Thanks Ravi Kiran Bhaskar
Question about morelikethis and multiple fields
Hello, I'm trying to implement a Related Articles feature within my search application using the mlt handler. To give you a little background information, my Solr index contains a single core that is created by merging 10+ other cores. Within this core is my main data item known as an article; however, there are other data items like technical documents, tickets, etc. When a user opens an article on my web application, I want to show Related Articles based on 2 fields (title and body). I am using SolrJ as a back-end for this . The way I'm thinking of doing it is to search on the title of the existing article, and hope that the first hit is that actual article. This works in most of the cases, but occasionally it grabs either the wrong article or a different type of data item altogether (the first hit my be a technical document, which is totally unrelated to articles). The following is my query: ?qt=%2Fmltmlt.match.include=truemlt.mindf=1mlt.mintf=1mlt.fl=title,bodyq=search stringfq=dataItem:articledebugQuery=true There is one main thing that I noticed is that this only seems to match on the body field and not the title field. I think it's doing what it's supposed to and I'm not fully grasping the idea of mlt. So when it does the initial search to find the document against which it will find related articles, what search handlers would it use? Normally, my queries are carried out using dismax with some boosting functionality applied to them. When I use the standard query handler however, with the qt parameter defining mlt, what happens for the initial search? Also, if anybody can suggest an alternative implementation to this I would greatly appreciate it. Like I said, it's entirely possible that I don't fully understand mlt and it's causing me to implement stuff in a weird way. Thanks/ -- View this message in context: http://lucene.472066.n3.nabble.com/Question-about-morelikethis-and-multiple-fields-tp1836778p1836778.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Negative or zero value for fieldNorm
Regarding Negative or zero value for fieldNorm, I don't see any negative fieldNorms here... just very small positive ones? Anyway the fieldNorm is the product of the lengthNorm and the index-time boost of the field (which is itself the product of the index time boost on the document and the index time boost of all instances of that field). Index time boosts default to 1 though, so they have no effect unless something has explicitly set a boost. -Yonik http://www.lucidimagination.com On Wed, Nov 3, 2010 at 2:30 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi all, I've got some puzzling issue here. During tests i noticed a document at the bottom of the results where it should not be. I query using DisMax on title and content field and have a boost on title using qf. Out of 30 results, only two documents also have the term in the title. Using debugQuery and fl=*,score i quickly noticed large negative maxScore of the complete resultset and a portion of the resultset where scores sum up to zero because of a product with 0 (fieldNorm). See below for debug output for a result with score = 0: 0.0 = (MATCH) sum of: 0.0 = (MATCH) max of: 0.0 = (MATCH) weight(content:kunstgrasveld in 7), product of: 0.75658196 = queryWeight(content:kunstgrasveld), product of: 6.6516633 = idf(docFreq=33, maxDocs=9682) 0.113743275 = queryNorm 0.0 = (MATCH) fieldWeight(content:kunstgrasveld in 7), product of: 2.236068 = tf(termFreq(content:kunstgrasveld)=5) 6.6516633 = idf(docFreq=33, maxDocs=9682) 0.0 = fieldNorm(field=content, doc=7) 0.0 = (MATCH) fieldWeight(title:kunstgrasveld in 7), product of: 1.0 = tf(termFreq(title:kunstgrasveld)=1) 8.791729 = idf(docFreq=3, maxDocs=9682) 0.0 = fieldNorm(field=title, doc=7) And one with a negative score: 3.0716116E-4 = (MATCH) sum of: 3.0716116E-4 = (MATCH) max of: 3.0716116E-4 = (MATCH) weight(content:kunstgrasveld in 1462), product of: 0.75658196 = queryWeight(content:kunstgrasveld), product of: 6.6516633 = idf(docFreq=33, maxDocs=9682) 0.113743275 = queryNorm 4.059853E-4 = (MATCH) fieldWeight(content:kunstgrasveld in 1462), product of: 1.0 = tf(termFreq(content:kunstgrasveld)=1) 6.6516633 = idf(docFreq=33, maxDocs=9682) 6.1035156E-5 = fieldNorm(field=content, doc=1462) There are no funky issues with term analysis for the text fieldType, in fact, the term passes through unchanged. I don't do omitNorms, i store termVectors etc. Because fieldNorm = fieldBoost / sqrt(numTermsForField) i suspect my input from Nutch is messed up. A fieldNorm can never be = 0 for a normal positive boost and field boosts should not be zero or negative (correct me if i'm wrong). But, since i can't yet figure out what field boosts Nutch sends to me i thought i'd drop by on this mailing list first. There are quite a few query terms that return with zero or negative scores and many that behave as i expect. I find it also a bit hard to comprehend why the docs with negative score rank higher in the result set than documents with zero score. Sorting defaults to score DESC, but this is perhaps another issue. Anyway, the test runs on a Solr 1.4.1 instance with Java 6 under the hood. Help or directions are appreciated =) Cheers, -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
Re: blacklist docs by uniqueKey
How dynamic is this list? Is it feasable to add a field to your docs like blacklisteddocs, and at editorial's discretion add values to that field like app1, app2? At that point you can just filter them out via a filter query... Best Erick On Wed, Nov 3, 2010 at 2:40 PM, Ravi Kiran ravi.bhas...@gmail.com wrote: Hello, I have a single core servicing 3 different applications, one of the application doesnt want some specific docs to show up (driven by Editorial decision). Over a period of time the amount of blacklisted docs could grow, hence I do not want to restrict them in a query as it the query could get extremely large. Is there a configuration option where we can blacklist ids (uniqueKey) from showing up in results. Is there anything similar to EvelationComponent that demotes docs ? This could be ideal. I tried to look up and see if there was a boosting option in elevation component so that I could negatively boost certain docs but could not find any. Can anybody kindly point me in the right direction. Thanks Ravi Kiran Bhaskar
Re: blacklist docs by uniqueKey
On Wed, Nov 3, 2010 at 3:05 PM, Erick Erickson erickerick...@gmail.com wrote: How dynamic is this list? Is it feasable to add a field to your docs like blacklisteddocs, and at editorial's discretion add values to that field like app1, app2? At that point you can just filter them out via a filter query... Right, or a combination of the two approaches. For a realtime approach, add the newest filters (say any filters added that day) to a filter query, and roll those into a nightly reindex. -Yonik http://www.lucidimagination.com Best Erick On Wed, Nov 3, 2010 at 2:40 PM, Ravi Kiran ravi.bhas...@gmail.com wrote: Hello, I have a single core servicing 3 different applications, one of the application doesnt want some specific docs to show up (driven by Editorial decision). Over a period of time the amount of blacklisted docs could grow, hence I do not want to restrict them in a query as it the query could get extremely large. Is there a configuration option where we can blacklist ids (uniqueKey) from showing up in results. Is there anything similar to EvelationComponent that demotes docs ? This could be ideal. I tried to look up and see if there was a boosting option in elevation component so that I could negatively boost certain docs but could not find any. Can anybody kindly point me in the right direction. Thanks Ravi Kiran Bhaskar
How to display the synonyms
Hi, If the synonym.txt file define the following castle,fort I am able to match fort when the user wants to search for castle. However, I would like to tell the user that castle is a synonym for fort. It is for those users that may wonder why they got a different search result when they were looking for castle. Is there a way to get that info when the search is made. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-display-the-synonyms-tp1837103p1837103.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Negative or zero value for fieldNorm
Regarding Negative or zero value for fieldNorm, I don't see any negative fieldNorms here... just very small positive ones? Of course, you're right. The E-# got twisted in my mind and became negative. Silly me. Anyway the fieldNorm is the product of the lengthNorm and the index-time boost of the field (which is itself the product of the index time boost on the document and the index time boost of all instances of that field). Index time boosts default to 1 though, so they have no effect unless something has explicitly set a boost. I've just checked docs 7 and 1462 (resp. first and second in debug output below) with Luke. The title and content fields have no index time boosts, thus defaulting to 1.0f which is fine. Then, why does doc 7 have a fieldNorm of 0.0 on title (and so setting a 0.0 score on the doc in the result set) and does doc 1462 have a very very small fieldNorm? debugOutput for doc 7: 0.0 = fieldNorm(field=title, doc=7) Luke on the title field of doc 7. float name=boost1.0/float Thanks for your reply! -Yonik http://www.lucidimagination.com On Wed, Nov 3, 2010 at 2:30 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi all, I've got some puzzling issue here. During tests i noticed a document at the bottom of the results where it should not be. I query using DisMax on title and content field and have a boost on title using qf. Out of 30 results, only two documents also have the term in the title. Using debugQuery and fl=*,score i quickly noticed large negative maxScore of the complete resultset and a portion of the resultset where scores sum up to zero because of a product with 0 (fieldNorm). See below for debug output for a result with score = 0: 0.0 = (MATCH) sum of: 0.0 = (MATCH) max of: 0.0 = (MATCH) weight(content:kunstgrasveld in 7), product of: 0.75658196 = queryWeight(content:kunstgrasveld), product of: 6.6516633 = idf(docFreq=33, maxDocs=9682) 0.113743275 = queryNorm 0.0 = (MATCH) fieldWeight(content:kunstgrasveld in 7), product of: 2.236068 = tf(termFreq(content:kunstgrasveld)=5) 6.6516633 = idf(docFreq=33, maxDocs=9682) 0.0 = fieldNorm(field=content, doc=7) 0.0 = (MATCH) fieldWeight(title:kunstgrasveld in 7), product of: 1.0 = tf(termFreq(title:kunstgrasveld)=1) 8.791729 = idf(docFreq=3, maxDocs=9682) 0.0 = fieldNorm(field=title, doc=7) And one with a negative score: 3.0716116E-4 = (MATCH) sum of: 3.0716116E-4 = (MATCH) max of: 3.0716116E-4 = (MATCH) weight(content:kunstgrasveld in 1462), product of: 0.75658196 = queryWeight(content:kunstgrasveld), product of: 6.6516633 = idf(docFreq=33, maxDocs=9682) 0.113743275 = queryNorm 4.059853E-4 = (MATCH) fieldWeight(content:kunstgrasveld in 1462), product of: 1.0 = tf(termFreq(content:kunstgrasveld)=1) 6.6516633 = idf(docFreq=33, maxDocs=9682) 6.1035156E-5 = fieldNorm(field=content, doc=1462) There are no funky issues with term analysis for the text fieldType, in fact, the term passes through unchanged. I don't do omitNorms, i store termVectors etc. Because fieldNorm = fieldBoost / sqrt(numTermsForField) i suspect my input from Nutch is messed up. A fieldNorm can never be = 0 for a normal positive boost and field boosts should not be zero or negative (correct me if i'm wrong). But, since i can't yet figure out what field boosts Nutch sends to me i thought i'd drop by on this mailing list first. There are quite a few query terms that return with zero or negative scores and many that behave as i expect. I find it also a bit hard to comprehend why the docs with negative score rank higher in the result set than documents with zero score. Sorting defaults to score DESC, but this is perhaps another issue. Anyway, the test runs on a Solr 1.4.1 instance with Java 6 under the hood. Help or directions are appreciated =) Cheers, -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
Re: Question about morelikethis and multiple fields
I don't quite understand what you mean by that. Did you mean TermVector Components? Also, I did some more digging and I found some messages on this mailing list about filtering. From what I understand, using the standard query handler (solr/select/?q=...) with a qt parameter allows you to filter on the initial response using the fq parameter. While this is not a perfect solution for my application, it will greatly reduce any errors that I may get in the data. However, when I tried fq, all it's doing is filtering on the result set from the mlt handler, not the initial response. I need to filter on both the initial response and the result set. -- View this message in context: http://lucene.472066.n3.nabble.com/Question-about-morelikethis-and-multiple-fields-tp1836778p1837351.html Sent from the Solr - User mailing list archive at Nabble.com.
Filter by relevance
Is it possible to filter my search results by relevance? For example, anything below a certain value shouldn't be returned? I also retrieve facet counts in my search queries, so it would be useful if the facet counts also respected the filter on the relevance. Thank You. Jason. If you wish to view the St. James's Place email disclaimer, please use the link below http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
Re: blacklist docs by uniqueKey
I don't believe there is, but it occurs to me that the additional feature that Tom Burton-West contemplates in the thread filter query from external list of Solr unique IDs could potentially address your problem too, if it existed. I think that feature could also address a variety of problems, I've been thinking about it. http://apache.markmail.org/message/etqwbv6piikaqgo5 Ravi Kiran wrote: Hello, I have a single core servicing 3 different applications, one of the application doesnt want some specific docs to show up (driven by Editorial decision). Over a period of time the amount of blacklisted docs could grow, hence I do not want to restrict them in a query as it the query could get extremely large. Is there a configuration option where we can blacklist ids (uniqueKey) from showing up in results. Is there anything similar to EvelationComponent that demotes docs ? This could be ideal. I tried to look up and see if there was a boosting option in elevation component so that I could negatively boost certain docs but could not find any. Can anybody kindly point me in the right direction. Thanks Ravi Kiran Bhaskar
phrase boost on dismax query
I have 3 fields in my index that I use in a dismax query with boosts and phrase boosts. I've realised that 1 field I'm not really interested in at all, unless the search term is in that field as a phrase. Is it realistic to set the normal boost to zero for this field, but the phrase boost to soemthing much higher in order to achieve the desired effect? Thank You If you wish to view the St. James's Place email disclaimer, please use the link below http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
Re: blacklist docs by uniqueKey
How does the exclude=true option in elevate.xml perform with large number of excludes? Then you could have a separate elevate config for that client. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 3. nov. 2010, at 20.11, Yonik Seeley wrote: On Wed, Nov 3, 2010 at 3:05 PM, Erick Erickson erickerick...@gmail.com wrote: How dynamic is this list? Is it feasable to add a field to your docs like blacklisteddocs, and at editorial's discretion add values to that field like app1, app2? At that point you can just filter them out via a filter query... Right, or a combination of the two approaches. For a realtime approach, add the newest filters (say any filters added that day) to a filter query, and roll those into a nightly reindex. -Yonik http://www.lucidimagination.com Best Erick On Wed, Nov 3, 2010 at 2:40 PM, Ravi Kiran ravi.bhas...@gmail.com wrote: Hello, I have a single core servicing 3 different applications, one of the application doesnt want some specific docs to show up (driven by Editorial decision). Over a period of time the amount of blacklisted docs could grow, hence I do not want to restrict them in a query as it the query could get extremely large. Is there a configuration option where we can blacklist ids (uniqueKey) from showing up in results. Is there anything similar to EvelationComponent that demotes docs ? This could be ideal. I tried to look up and see if there was a boosting option in elevation component so that I could negatively boost certain docs but could not find any. Can anybody kindly point me in the right direction. Thanks Ravi Kiran Bhaskar
Re: Filter by relevance
Is it possible to filter my search results by relevance? For example, anything below a certain value shouldn't be returned? http://search-lucene.com/m/4AHNF17wIJW1/
RE: blacklist docs by uniqueKey
A filter that could accept a list of SOLR document IDs as articulated by Tom Burton-West would enable some important features for our application. So if anyone is wondering if this would be a useful feature, consider this a yes vote. -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Wednesday, November 03, 2010 3:55 PM To: solr-user@lucene.apache.org Subject: Re: blacklist docs by uniqueKey I don't believe there is, but it occurs to me that the additional feature that Tom Burton-West contemplates in the thread filter query from external list of Solr unique IDs could potentially address your problem too, if it existed. I think that feature could also address a variety of problems, I've been thinking about it. http://apache.markmail.org/message/etqwbv6piikaqgo5 Ravi Kiran wrote: Hello, I have a single core servicing 3 different applications, one of the application doesnt want some specific docs to show up (driven by Editorial decision). Over a period of time the amount of blacklisted docs could grow, hence I do not want to restrict them in a query as it the query could get extremely large. Is there a configuration option where we can blacklist ids (uniqueKey) from showing up in results. Is there anything similar to EvelationComponent that demotes docs ? This could be ideal. I tried to look up and see if there was a boosting option in elevation component so that I could negatively boost certain docs but could not find any. Can anybody kindly point me in the right direction. Thanks Ravi Kiran Bhaskar
Re: Possible memory leaks with frequent replication
Do you use EmbeddedSolr in the query server? There is a memory leak that shows up when taking a lot of replications. On Wed, Nov 3, 2010 at 8:28 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Ah, but reading Peter's email message I reference more carefully, it seems that Solr already DOES provide an info-level log warning you about over-lapping warming, awesome. (But again, I'm pretty sure it does NOT throw or HTTP error in that condition, based on my and others experience). To check if your Solr environment is suffering from this, turn on INFO level logging, and look for: 'PERFORMANCE WARNING: Overlapping onDeckSearchers=x'. Sweet, good to know, and I'll definitely add this to my debugging toolbox. Peter's listserv message really ought to be a wiki page, I think. Any reason for me not to just add it as a new one with title Commit frequency and auto-warming or something like that? Unless it's already in the wiki somewhere I haven't found, assuming the wiki will let an ordinary user-created account add a new page. // Jonathan Rochkind wrote: I hadn't looked at the code, am not familiar with Solr code, and can't say what that code does. But I have experienced issues that I _believe_ were caused by too frequent commits causing over-lapping searcher preperation. And I've definitely seen Solr documentation that suggests this is an issue. Let me find it now to see if the experts think these documented suggests are still correct or not: On the other hand, autowarming (populating) a new collection could take a lot of time, especially since it uses only one thread and one CPU. If your settings fire off snapinstaller too frequently, then a Solr slave could be in the undesirable condition of handing-off queries to one (old) collection, and, while warming a new collection, a second “new” one could be snapped and begin warming! If we attempted to solve such a situation, we would have to invalidate the first “new” collection in order to use the second one, then when a “third” new collection would be snapped and warmed, we would have to invalidate the “second” new collection, and so on ad infinitum. A completely warmed collection would never make it to full term before it was aborted. This can be prevented with a properly tuned configuration so new collections do not get installed too rapidly. http://wiki.apache.org/solr/SolrPerformanceFactors#Updates_and_Commit_Frequency_Tradeoffs I think I've seen that same advice on another wiki page without being specifically regarding replication, but just being about commit frequency balanced with auto-warming, leading to overlapping warming, leading to spiraling RAM/CPU usage -- but NOT an exception being thrown or HTTP error delivered. I can't find it on the wiki, but here's a listserv post with someone reporting findings that match my understanding: http://osdir.com/ml/solr-user.lucene.apache.org/2010-09/msg00528.html How does this advice square with the code Lance found? Is my understanding of how frequent commits can interact with time it takes to warm a new collection correct? Appreciate any additional info. Lance Norskog wrote: Isn't that what this code does? onDeckSearchers++; if (onDeckSearchers 1) { // should never happen... just a sanity check log.error(logid+ERROR!!! onDeckSearchers is + onDeckSearchers); onDeckSearchers=1; // reset } else if (onDeckSearchers maxWarmingSearchers) { onDeckSearchers--; String msg=Error opening new searcher. exceeded limit of maxWarmingSearchers=+maxWarmingSearchers + , try again later.; log.warn(logid++ msg); // HTTP 503==service unavailable, or 409==Conflict throw new SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE,msg,true); } else if (onDeckSearchers 1) { log.info(logid+PERFORMANCE WARNING: Overlapping onDeckSearchers= + onDeckSearchers); } On Tue, Nov 2, 2010 at 10:02 AM, Jonathan Rochkind rochk...@jhu.edu wrote: It's definitely a known 'issue' that you can't replicate (or do any other kind of index change, including a commit) at a faster frequency than your warming queries take to complete, or you'll wind up with something like you've seen. It's in some documentation somewhere I saw, for sure. The advice to 'just query against the master' is kind of odd, because, then... why have a slave at all, if you aren't going to query against it? I guess just for backup purposes. But even with just one solr, or querying master, if you commit at rate such that commits come before the warming queries can complete, you're going to have the same issue. The only answer I know of is Don't commit (or replicate) at a faster rate than it takes your warming to complete. You can reduce your warming queries/operations, or reduce your commit/replicate frequency. Would be interesting/useful if Solr noticed this going on, and gave you
Re: Filter by relevance
Be aware, though, that relevance isn't absolute, it's only interesting #within# a query. And it's then normed between 0 and 1. So picking a certain value is rarely doing what you think it will. Limiting to the top N docs is usually more reasonable But this may be an XY problem. What is it you're trying to accomplish? Perhaps if you state the problem, some other suggestions may be in the offing Best Erick On Wed, Nov 3, 2010 at 4:48 PM, Jason Brown jason.br...@sjp.co.uk wrote: Is it possible to filter my search results by relevance? For example, anything below a certain value shouldn't be returned? I also retrieve facet counts in my search queries, so it would be useful if the facet counts also respected the filter on the relevance. Thank You. Jason. If you wish to view the St. James's Place email disclaimer, please use the link below http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
ZendCon 2010 - Slides on Building Intelligent Search Applications with Apache Solr and PHP 5
Due to popular demand, the link to my slides @ ZendCon are now available here in case anyone else is looking for it. http://slidesha.re/bAXNF3 The sample code will be uploaded shortly. Feedback is also appreciated http://joind.in/2261 -- °O° Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. http://www.israelekpo.com/
Re: blacklist docs by uniqueKey
Mr.Rochkind pointed out the exact requirement I had in mind i.e. filter query from external list of Solr unique IDs. On the flip side, even filter queries can be dicey for me as I could very easily blow past the 1024 bytes URL GET limit as my original queries itself are very long..just adding 100 or 200 IDs to exclude could cause troubles. This is the exactly why I am trying to find a configuration option as opposed to writing filter queries Thank you all for actively helping me out. Ravi Kiran Bhaskar Principal Software Engineer Washington Post 1150 15th Street NW, Washington, DC 20071 On Wed, Nov 3, 2010 at 4:55 PM, Jonathan Rochkind rochk...@jhu.edu wrote: I don't believe there is, but it occurs to me that the additional feature that Tom Burton-West contemplates in the thread filter query from external list of Solr unique IDs could potentially address your problem too, if it existed. I think that feature could also address a variety of problems, I've been thinking about it. http://apache.markmail.org/message/etqwbv6piikaqgo5 Ravi Kiran wrote: Hello, I have a single core servicing 3 different applications, one of the application doesnt want some specific docs to show up (driven by Editorial decision). Over a period of time the amount of blacklisted docs could grow, hence I do not want to restrict them in a query as it the query could get extremely large. Is there a configuration option where we can blacklist ids (uniqueKey) from showing up in results. Is there anything similar to EvelationComponent that demotes docs ? This could be ideal. I tried to look up and see if there was a boosting option in elevation component so that I could negatively boost certain docs but could not find any. Can anybody kindly point me in the right direction. Thanks Ravi Kiran Bhaskar
Re: blacklist docs by uniqueKey
Yes I also did see the exclude=true in an example elevate.xml...was wondering what it does precisely and if text MUST have a value ? I couldnt find any documentation explaining it query text=ipod doc id=MA147LL/A / !-- put the actual ipod at the top -- doc id=IW-02 exclude=true / !-- exclude this cable -- /query Ravi Kiran Bhaskar Principal Software Engineer Washington Post 1150 15th Street NW, Washington, DC 20071 On Wed, Nov 3, 2010 at 5:12 PM, Jan Høydahl / Cominvent jan@cominvent.com wrote: How does the exclude=true option in elevate.xml perform with large number of excludes? Then you could have a separate elevate config for that client. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 3. nov. 2010, at 20.11, Yonik Seeley wrote: On Wed, Nov 3, 2010 at 3:05 PM, Erick Erickson erickerick...@gmail.com wrote: How dynamic is this list? Is it feasable to add a field to your docs like blacklisteddocs, and at editorial's discretion add values to that field like app1, app2? At that point you can just filter them out via a filter query... Right, or a combination of the two approaches. For a realtime approach, add the newest filters (say any filters added that day) to a filter query, and roll those into a nightly reindex. -Yonik http://www.lucidimagination.com Best Erick On Wed, Nov 3, 2010 at 2:40 PM, Ravi Kiran ravi.bhas...@gmail.com wrote: Hello, I have a single core servicing 3 different applications, one of the application doesnt want some specific docs to show up (driven by Editorial decision). Over a period of time the amount of blacklisted docs could grow, hence I do not want to restrict them in a query as it the query could get extremely large. Is there a configuration option where we can blacklist ids (uniqueKey) from showing up in results. Is there anything similar to EvelationComponent that demotes docs ? This could be ideal. I tried to look up and see if there was a boosting option in elevation component so that I could negatively boost certain docs but could not find any. Can anybody kindly point me in the right direction. Thanks Ravi Kiran Bhaskar
Re: A bug in ComplexPhraseQuery ?
However, we have found that this query is crashing when using CoomplexPhraseQuery: sulfur-reducing bacteria It is due to the dash inside the phrase. Here is the trace: java.lang.IllegalArgumentException: Unknown query type org.apache.lucene.search.PhraseQuery found in phrase query string sulfur-reducing bacteria I added Terje Eggestad's fix[1], can you test it give us feedback? [1]https://issues.apache.org/jira/browse/LUCENE-1486?focusedCommentId=12900278page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12900278
Does Solr support Natural Language Search
Does Solr support Natural Language Search? I did not find any thing about this in the reference manual. Please let me know. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Does-Solr-support-Natural-Language-Search-tp1839262p1839262.html Sent from the Solr - User mailing list archive at Nabble.com.
Problem escaping question marks
I'm having difficulty properly escaping ? in my search queries. It seems as tho it matches any character. Some info, a simplified schema and query to explain the issue I'm having. I'm currently running solr1.4.1 Schema: field name=id type=sint indexed=true stored=true required=true / field name=first_name type=string indexed=true stored=true required=false / I want to return any first name with a Question Mark in it Query: first_name: *\?* Returns all documents with any character in it. Can anyone lend a hand? Thanks! Stephen
Re: replication not working between 1.4.1 and 3.1-dev
On 10/29/2010 4:33 PM, Shawn Heisey wrote: The recommended method of safely upgrading Solr that I've read about is to upgrade slave servers, keeping your production application pointed either at another set of slave servers or your master servers. Then you test it with a dev copy of your application, and once you're sure it's working, you can switch production traffic over to the upgraded set. If it falls over, you just switch back to the old version. Once you're sure it's TRULY working, you upgrade everything else. To convert fully to the new index format, you have the option of reindexing or optimizing your existing indexes. I like this method, and this is the way I want to do it, except that the new javabin format makes it impossible. I need a viable way to replicate indexes from a set of 1.4.1 master servers to 3.1-dev slaves. Delving into the source and tackling the problem myself is something I would truly love to do, but I lack the necessary skills. Since I don't have the java skills required to solve the underlying problem, I have come up with a solution in the realm that I do understand - my build scripts. I will update the scripts so that they can safely work on the slave machines as well as the masters. They are currently hard-coded to work on the masters. By turning replication off and running the scripts against both server sets, I'll be able to do all my testing. IMHO this incompatibility with replication is a bug that needs to be fixed before the official release, which is why I filed SOLR-2204. I have found a way around it, but the workaround might not be a viable option for everyone. Thanks, Shawn