Re: Explicitly tell Solr the analyzed value when indexing a document
I have a couple of string fields. For some of them I want from my application to be able to index a lowercased string but store the original value. Is there some way to do this? Or would I have to come up with a new field type and implement an analyzer? If you have stored=true in your field definition, solr always stores original value. Response returns that original and stored values. Search and faceting are done against indexed values. Therefore you don't need to to anything special in your case.
Re: Explicitly tell Solr the analyzed value when indexing a document
I have a couple of string fields. For some of them I want from my application to be able to index a lowercased string but store the original value. Is there some way to do this? Or would I have to come up with a new field type and implement an analyzer? If you have stored=true in your field definition, solr always stores original value. Response returns that original and stored values. Search and faceting are done against indexed values. Therefore you don't need to to anything special in your case. I want for a string field faceting to return monkey while the original value is *Monkey. So I want indexed be lowercase and stored the original value. That is, I want to do the analyzing in my application and tell Solr what to use for indexed and what to use for stored. /Tim
Solr Near Real-Time Search, Soft Commit Problem
Hi, I was trying to configure a Solr instance with the near real-time search and auto-complete capabilities. I stuck in the NRT feature. There are 15 new records per second that inserted into the database (mysql) and I indexed them with DIH. First, I tried to manage autoCommits from solrconfig.xml with the configuration below. autoCommit maxDocs1/maxDocs maxTime10/maxTime /autoCommit autoSoftCommit maxDocs15/maxDocs maxTime1000/maxTime /autoSoftCommit And the bash script below responsible for getting delta's without committing. while [ 1 ]; do wget -O /dev/null ' http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false' 2/dev/null sleep 1 done Then I run my query from browser http://localhost:8080/solr-jak/select?q=movie_name_prefix_full:dogvilledefType=luceneq.op=ORhttp://localhost:8080/solr-sprongo/select?q=movie_name_prefix_full:%221398%22defType=luceneq.op=OR But I realized that, with this configuration index files are changing every second and after a minute there are only 600 new records in Solr index while 900 new records in the database. After experienced that, I removed autoCommit and autoSoftCommit elements in solrconfig.xml And updated my bashscript as follows. But still index files are changing and solr can not syncronized with database. while [ 1 ]; do echo Soft commit applied! wget -O /dev/null ' http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false' 2/dev/null curl http://localhost:8080/solr-jak/update -H Content-Type: text/xml --data-binary 'commit softCommit=true waitFlush=false waitSearcher=false/' 2/dev/null sleep 3 done Even I decreased the pressure on Solr as 1 new record per sec. and soft commits within 6 sec. still there is a gap between index and db. Is there anything that I missed? I took a look to /get too, but it is working only for pk. If there is an example configuration list (like 1 sec for soft commit and 10 min for hard commit) as a best practice it would be great. Finally, here is my configuration. Ubuntu 11.04 JDK 1.6.0_27 Tomcat 7.0.21 Solr 4.0 2011-10-24_08-53-02 All advices are appreciated, Best Regards, Jak
Re: Explicitly tell Solr the analyzed value when indexing a document
I have a couple of string fields. For some of them I want from my application to be able to index a lowercased string but store the original value. Is there some way to do this? Or would I have to come up with a new field type and implement an analyzer? I think I should be able to do what I want to do with solr.PatternReplaceCharFilterFactory. /Tim
Re: Explicitly tell Solr the analyzed value when indexing a document
I want for a string field faceting to return monkey while the original value is *Monkey. So I want indexed be lowercase and stored the original value. That is, I want to do the analyzing in my application and tell Solr what to use for indexed and what to use for stored. Sorry but I don't follow. Why don't you just use lowercase filter?
Re: Highlighting apostrophe
Hi, have you found the solution to your highlighting apostrophe problem? -- View this message in context: http://lucene.472066.n3.nabble.com/Highlighting-apostrophe-tp731155p3515139.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: delta-import of rich documents like word and pdf files!
Thank you for your replies guys.that helped a lot. Thanks iorixxx that was the command that worked out. I also tried my solr with mysql and that worked too. Congo! :) Now, I want to index my files according to their size and facet them according to their size ranges. I know that there is an option of fileSize in FileListEntityProcessor but I am not getting any way to perform this. Is fileSize a metadata? If it is, then the steps I performed are - I created a field name and dynamic field name in schema. as dynamicField name=metadata_* type=string indexed=true stored=true multiValued=false/ field name=fileSize type=string indexed=true stored=true required=false/ Added range facet in solrconfig.xml and in data-config.xml I added a str according to field in data-config.xml --solrconfig.xml int name=f.fileSize.facet.range.start0/int int name=f.fileSize.facet.range.gap100/int int name=f.fileSize.facet.range.end600/int --- data-config.xml--- field column=FileSize name=fileSize / --- But that did not work out! Am I missing something? Please help me out. Thanks in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515298.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: to prevent number-of-matching-terms in contributing score
On Thu, Nov 17, 2011 at 6:59 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : 1. omitTermFreqAndPositions is very straightforward but if I avoid : positions I'll refuse to serve phrase queries. I had searched for this in but do you really need phrase queries on your cat field? i thought the point was to have simple matching on those terms? Yes I need to match phrases. Consider following documents Doc1 - categories: teak wooden chair, bamboo wooden chair Doc2 - categories: wooden chair Doc3 - categories: plastic chair, wooden cupboard. A query wooden chair should give doc1 and doc2 with equal score (provided other fields generate same score) and doc3 should be excluded. Non-phrase match would include doc3 as well. : 2. Function query seemed nice (though strange because I never used it : before) and I gave it a few hours but that too did not seem to solve my : requirement. The artificial score we are generating is getting multiplied : into rest of the score which includes score due to cat field as well. (I : can not remove cat from qf as I have to search there). It is only that : I don't want this field's score on the basis of matching tf. I don't think i realized you were using dismax ... if you just want a match on cat to help determine if the document is a match, but not have *any* impact on score, you could just set the qf boost to 0 (ie: qf=title^10 cat^0) but i'm not sure if that's really what you want. Well this is almost what I want. (Thanks for telling me about ^0. I learned a new thing.). I wanted a constant score for a match in cat and I did not want the frequency of match in cat to affect the score which can be done this way. But I definitely want to generate some score, equal to single match (tf = 1) so that less important fields like description may not get higher boost than cat. Writing ^0 creates 0.00 score for a match in cat while a match in description will generate some positive score greater than zero (0). : After spending some hours on function queries I finally reached on : following query Honestly: i'm not really following what you tried there because of the formatting applied by your email client ... it seemed to be making tons of hyperlinks out of peices of the URL. Looking at your query explanation however the problem seems to be that you are still using the relevancy score of the matches on the cat field, instead of *just* using hte function boost... I did try *just* using the function boost, i.e. removed the cat from qf, but it did not seem to return documents which have matching categories just in cat field. The query was something like following (i hope it be clear this time) url?q={!boost b=$cat_boost v=$main_query} *main_query={!dismax qf=title v=$qry}* cat_boost={!func}map(query({!field f=cat v=$qry},-1),0,1000,5,1) qry=chair ... (note: i slightly modified the cat_boost parameter to use only single map() function with 5 argument form) It gave me just two docs where title contained the query word (chair) I also tried changing main_query like *main_query={!dismax qf=title cat v=$qry}* which gave me all 4 required docs but with scores varying on the basis of cat as well and *main_query={!dismax qf=title cat^0 v=$qry}* which gave me all required docs with a constant (0.0) cat score. but when I'll add description in qf, docs even with worst matching in description will score higher than docs with a good match in cat which is not exactly what is required. : But debugging the query showed that the boost value ($cat_boost) is being : multiplied into a value which is generated with the help of cat field : thus resulting in different scores for 1 and 3 (similarly for 2 and 4). : : 1.2942866 = (MATCH) boost(+(title:chair | cat:chair)~0.01 : (),map(query(cat:chair,def=-1.0),0.0,1000.0,1.0)), product of: ...my point before was to take cat:chair out of the main part of your query, and *only* put it in the boost function. if you are using dismax, the qf=cat^0 suggestion mentioned above *combined* with your boost function will probably get you what you want (i think) taking cat:chair out of main_query (dismax equivalent - removing cat from qf) or using cat^0 did not produce desired effect as I described earlier : I was thinking there should be some hook or plugin (or anything) which : could just change the score calculation formula *for a particular field*. : There is a function in DefaultSimilarity class - *public float tf(float : freq)* but that does not mention the field name. Is there a possibility to : look into this direction? on trunk, there is a distinct Similarity object per fieldtype, so you could certain look at that -- but you are correct that in 3x there is no way to override the tf() function on a per field basis. I'll definitely look at the Similarity class. I hope there are no performance degradation issues with it :) -Hoss Thank you very much. -- Regards, Samar
Re: delta-import of rich documents like word and pdf files!
Now, I want to index my files according to their size and facet them according to their size ranges. I know that there is an option of fileSize in FileListEntityProcessor but I am not getting any way to perform this. Is fileSize a metadata? You don't need a dynamic field for this. The following additions should enable and populate fileSize: in data-config.xml : entity name=f processor=FileListEntityProcessor ... field column=fileSize name=fileSize/ /entity in schema.xml : field name=fileSize type=string indexed=true stored=true required=false/
Re: delta-import of rich documents like word and pdf files!
Thanks for your reply, I performed these steps. in data-config.xml : entity name=f processor=FileListEntityProcessor ... field column=fileSize name=fileSize/ /entity in schema.xml : field name=fileSize type=string indexed=true stored=true required=false/ -- But still there is no response in browse sectionI edited facet_ranges.vm for this. It does not calculate size of the documents. can you please tell me the command to check that in response it shows size of file? Thanks again -- View this message in context: http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515495.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: delta-import of rich documents like word and pdf files!
And also I set my fileSize of type long. String will not work I think ! Size can not be a string...it shows error on using string as type. -- View this message in context: http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515505.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: delta-import of rich documents like word and pdf files!
I ran this command and can see size of my files http://localhost:8080/solr/select?q=userf.fileSize.facet.range.start=100 Great thanks...string worked...i dont know why that did not work last time But when I do that in browse section..following output i saw in my logs SEVERE: Exception during facet.range of fileSize:org.apache.solr.common.SolrException: Unable to range facet on field:fileSize{type=string,properties=indexed,stored,omitNorms,omitTermFreqAndPositions,sortMissingLast} at org.apache.solr.request.SimpleFacets.getFacetRangeCounts(SimpleFacets.java:834) at org.apache.solr.request.SimpleFacets.getFacetRangeCounts(SimpleFacets.java:778) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:178) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:72) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servl.. This does not come when I set it to type int and when I use int it does not show size!! Please help me out -- View this message in context: http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515567.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Master High Availability
Hi, im looking into High availability SOLR master configurations. Does anybody have a good solution to this the things im looking into are: * Using SOLR replication to keep a second backup master. * Indexing in a separate machine(s), problem being here that the index will be different from the other machine needing a full replication to all slaves in case of failure to first master. * Having the whole setup replicated to another machine which is then used as a master machine if primary master fails? Any more ideas/experiences? Toni ** IMPORTANT: This message is intended exclusively for information purposes. It cannot be considered as an official OHIM communication concerning procedures laid down in the Community Trade Mark Regulations and Designs Regulations. It is therefore not legally binding on the OHIM for the purpose of those procedures. The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete or destroy this message and any copies. **
ISO8601 Date format
Hello, due a different Bug in another system, we stored a date in a datefield with an value like 999-12-31T23:00:00Z. As you can see in the schema browser below, solr stores it correct with four digits but in a response the leading zero is missing. My question is: is a three digit year a valid ISO-8601 date format for the response or is this a bug? Because other languages (f.e. python) are throwing an exception with a three digit year?! Response: doc ... date name=effective999-12-31T23:00:00Z/date /doc Schema browser: Field: effective Field Type: date Properties: Indexed, Tokenized, Stored, undefined Schema: Indexed, Tokenized, Stored, undefined Index: Indexed, Tokenized, Stored Index Analyzer: org.apache.solr.analysis.TokenizerChain Details Query Analyzer: org.apache.solr.analysis.TokenizerChain Details Docs: 86727 Distinct: 4 termfrequency 0999-12-31T23:00:00Z165602 2011-11-05T23:00:00Z3543 2011-10-19T07:22:20.908Z2 2011-10-12T15:40:00Z2 Thx and best regards, Axel
Re: Aggregated indexing of updating RSS feeds
Thanks Chris. (Bell rings) The 'params' logging pointer was what I needed. So for reference its not a good idea to use a 'wget' command directly in a crontab. I was using: wget http://localhost/solr/myfeed?command=full-importrows=5000clean=false ...but moving this into a separate shell script, wrapping the URL in quotes and calling that resolved the issue. Thanks very much. -- View this message in context: http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3515388.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Aggregated indexing of updating RSS feeds
Am 17.11.2011 11:53, schrieb sbarriba: The 'params' logging pointer was what I needed. So for reference its not a good idea to use a 'wget' command directly in a crontab. I was using: wget http://localhost/solr/myfeed?command=full-importrows=5000clean=false :)) I think the shell handled the and sign as a flag to put the wget command into background. You could put the full url into quotes, or escape the and sign with a backslash. Then it should work as well. -Kuli
Re: Problems with AutoSuggest feature(Terms Components)
TermsComponent only reacts to what you send it. How are these requests getting to the TermsComponent? That's where you should look. As far as terms.limit, your requesthandler for TermsComponent in solrconfig.xml has a defaults section and you can set whatever you want in there and then override it as you choose if you sometimes want other values in there. Best Erick On Wed, Nov 16, 2011 at 9:17 AM, mechravi25 mechrav...@yahoo.co.in wrote: Hi, When i search for a data i noticed two things 1.) I noticed that *terms.regex=.** in the logs which does a blank search on terms because of the query time is more. Is there anyway to overcome this. My actual query should go like the first one bolded but instead of that it happens like in the second case(the 2nd text highlighted in bold) 2.) Also I noticed that *terms.limit=-1* which is very expensive as it asks solr to return all the terms back. It should be set to 10 or 20 at most. Please provide some suggestions to set the same. Nov 14, 2011 2:04:08 PM org.apache.solr.core.SolrCore execute INFO: [db] webapp=/solr path=/terms params={*terms.regex=ABC\+CCC\+lll*\+data.*terms.regex.flag=case_insensitiveterms.fl=nameFacet} status=0 QTime=935 Nov 14, 2011 2:04:08 PM org.apache.solr.core.SolrCore execute INFO: [core2] webapp=/solr path=/terms params={terms.regex.flag=case_insensitiveshards.qt=/termsterms.fl=nameFacetterms=trueterms.limit=-1terms.regex=ABC\+CCC\+lll\+data.*isShard=trueqt=/termswt=javabinterms.sort=indexversion=1} status=0 QTime=842 Nov 14, 2011 2:04:08 PM org.apache.solr.core.SolrCore execute INFO: [db] webapp=/solr path=/terms params={terms.regex=ABC\+CCC\+lll\+data.*terms.regex.flag=case_insensitiveterms.fl=nameFacet} status=0 QTime=927 Nov 14, 2011 2:04:08 PM org.apache.solr.core.SolrCore execute INFO: [core3] webapp=/solr path=/terms params={terms.regex.flag=case_insensitiveshards.qt=/termsterms.fl=nameFacetterms=trueterms.limit=-1terms.regex=.*isShard=trueqt=/termswt=javabinterms.sort=indexversion=1} status=0 QTime=115 Nov 14, 2011 2:05:55 PM org.apache.solr.core.SolrCore execute INFO: [core1] webapp=/solr path=/terms params={terms.regex.flag=case_insensitiveshards.qt=/termsterms.fl=nameFacetterms=trueterms.limit=-1*terms.regex=.**isShard=trueqt=/termswt=javabinterms.sort=indexversion=1} status=0 QTime=106767 Nov 14, 2011 2:05:55 PM org.apache.solr.core.SolrCore execute INFO: [core4] webapp=/solr path=/terms params={terms.regex.flag=case_insensitiveshards.qt=/termsterms.fl=nameFacetterms=trueterms.limit=-1terms.regex=.*isShard=trueqt=/termswt=javabinterms.sort=indexversion=1} status=0 QTime=106766 Nov 14, 2011 2:05:55 PM org.apache.solr.core.SolrCore execute -- View this message in context: http://lucene.472066.n3.nabble.com/Problems-with-AutoSuggest-feature-Terms-Components-tp3512734p3512734.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Phrase between quotes with dismax edismax
Thanks Erick for your prompt response. I am not sure but I think I found why the phrase chef de projet is not found by dismax and edismax. The following terms are indexed and can be seen with Luke: chef projet chef de projet When searching for the phrase chef de projet, the terms 'chef' and 'projet' are found in the index but 'de' is not found. And thus no results. Please note that using standard Lucene QueryParser, it works well. This is just what I suspect, does it sounds correct?? Best wishes, Jean-Claude On Wed, Nov 16, 2011 at 9:26 PM, Erick Erickson erickerick...@gmail.comwrote: Ah, ok I was mis-reading some things. So, let's ignore the category bits for now. Questions: 1 Can you refine down the problem. That is, demonstrate this with a single field and leave out the category stuff. Something like q=title:chef de projet getting no results and q=title:chef projet getting results? The idea is to cycle through all the fields to see if we can hone in on the problem. I'd get rid of any pf parameters of your edismax definition too. I'm after the simplest case that can demonstrate the issue. For that matter, it'd be even easier if you could make this happen with the default searcher ( solr/select?q=title:chef de projet 2 if you can do 1, please post the field definitions from your schema.xml file. One possibility is that you are removing stopwords at index time but not query time or vice-versa, but that's a wild guess. 3 Once you have a field, use the admin/analysis page to see the exact transformations that occur at index and query time to see if anything jumps out. All in all, I suspect you have a field that isn't being parsed as you expect at either index or query time, but as I said above, that's a guess. Best Erick On Wed, Nov 16, 2011 at 5:02 AM, Jean-Claude Dauphin jc.daup...@gmail.com wrote: Thanks Erick for yr quick answer. I am using Solr 3.1 1) I have set the mm parameter to 0 and removed the categories from the search. Thus the query is only for chef de projet and nothing else. But the problem remains, i.e searching for chef de projet gives no results while searching for chef projet gives the right result. Here is an excerpt from the test I made: DISMAX query (q)=(chef de projet) =The Parameters= *queryResponse*=[{responseHeader={status=0,QTime=157, params={facet=true, f.createDate.facet.date.start=NOW/DAY-6DAYS,tie=0.1, facet.limit=4, f.location.facet.limit=3, *q.alt*=*:*, facet.date.other=all, hl=true,version=2, *bq*=[categoryPayloads:category1071^1, categoryPayloads:category10055078^1, categoryPayloads:category10055405^1], fl=*,score, debugQuery=true, facet.field=[soldProvisions, contractTypeText, nafCodeText, createDate, wage, keywords, labelLocation, jobCode, organizationName, requiredExperienceLevelText], *qs*=3, qt=edismax, facet.date.end=NOW/DAY, *mm*=0, facet.mincount=1, facet.date=createDate, *qf*= title^4.0 formattedDescription^2.0 nafCodeText^2.0 jobCodeText^3.0 organizationName^1.0 keywords^3.0 location^1.0 labelLocation^1.0 categoryPayloads^1.0, hl.fl=title, wt=javabin, rows=20, start=0, *q*=(chef de projet), facet.date.gap=+1DAY, *stopwords*=false, *ps*=3}}, The Solr Response response={numFound=0 Debug Info debug={ *rawquerystring*=(chef de projet), *querystring*=(chef de projet), *--- * *parsedquery*= +*DisjunctionMaxQuery*((title:chef de projet~3^4.0 | keywords:chef de projet^3.0 | organizationName:chef de projet | location:chef de projet | formattedDescription:chef de projet~3^2.0 | nafCodeText:chef de projet^2.0 | jobCodeText:chef de projet^3.0 | categoryPayloads:chef de projet~3 | labelLocation:chef de projet)~0.1) *DisjunctionMaxQuery*((title:((chef chef) de (projet) projet)~3^4.0)~0.1) categoryPayloads:category1071 categoryPayloads:category10055078 categoryPayloads:category10055405, *---* *parsedquery_toString*=+(title:chef de projet~3^4.0 | keywords:chef de projet^3.0 | organizationName:chef de projet | location:chef de projet | formattedDescription:chef de projet~3^2.0 | nafCodeText:chef de projet^2.0 | jobCodeText:chef de projet^3.0 | categoryPayloads:chef de projet~3 | labelLocation:chef de projet)~0.1 (title:((chef chef) de (projet) projet)~3^4.0)~0.1 categoryPayloads:category1071 categoryPayloads:category10055078 categoryPayloads:category10055405, explain={}, QParser=ExtendedDismaxQParser,altquerystring=null, *boost_queries*=[categoryPayloads:category1071^1, categoryPayloads:category10055078^1, categoryPayloads:category10055405^1], *parsed_boost_queries*=[categoryPayloads:category1071, categoryPayloads:category10055078,
Re: delta-import of rich documents like word and pdf files!
Sorry for disturbing you allactually I had to add plong instead of type string. My problem is solved Be ready for new thread CHEERS -- View this message in context: http://lucene.472066.n3.nabble.com/delta-import-of-rich-documents-like-word-and-pdf-files-tp3502039p3515711.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: ISO8601 Date format
On Thu, Nov 17, 2011 at 6:06 PM, Gerke, Axel axel.ge...@haufe-lexware.com wrote: Hello, due a different Bug in another system, we stored a date in a datefield with an value like 999-12-31T23:00:00Z. As you can see in the schema browser below, solr stores it correct with four digits but in a response the leading zero is missing. My question is: is a three digit year a valid ISO-8601 date format for the response or is this a bug? Because other languages (f.e. python) are throwing an exception with a three digit year?! http://www.w3.org/TR/NOTE-datetime , and http://en.wikipedia.org/wiki/ISO_8601 seem to indicate that a four-digit year with leading zeroes is required. To quote from the General principles section in the latter reference: Each date and time value has a fixed number of digits that must be padded with leading zeros. Regards, Gora
FunctionQuery score=0
Hi, I am using a function query that based on the query of the user gives a score for the results I am presenting. Some of the results are receiving score=0 in my function and I would like them not to appear in the search results. How can I achieve that? Thanks in advance.
Re: strange behavior of scores and term proximity use
Hmmm, I'm not seeing similar behavior on a trunk from today, when did you get your copy? Erick On Wed, Nov 16, 2011 at 2:06 PM, Ariel Zerbib ariel.zer...@gmail.com wrote: Hi, For this term proximity query: ab_main_title_l0:to be or not to be~1000 http://localhost:/solr/select?q=ab_main_title_l0%3A%22og54ct8n+to+be+or+not+to+be+5w8ojsx2%22~1000sort=score+descstart=0rows=3fl=ab_main_title_l0%2Cscore%2CiddebugQuery=true The third first results are the following one: ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime5/int /lst result name=response numFound=318 start=0 maxScore=3.0814114 doc long name=id2315190010001021/long arr name=ab_main_title_l0 strog54ct8n To be or not to be a Jew. 5w8ojsx2/str /arr float name=score3.0814114/float/doc doc long name=id2313006480001021/long arr name=ab_main_title_l0 strog54ct8n To be or not to be 5w8ojsx2/str /arr float name=score3.0814114/float/doc doc long name=id2356410250001021/long arr name=ab_main_title_l0 strog54ct8n Rumspringa : to be or not to be Amish / 5w8ojsx2/str /arr float name=score3.0814114/float/doc /result lst name=debug str name=rawquerystringab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000/str str name=querystringab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000/str str name=parsedqueryPhraseQuery(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000)/str str name=parsedquery_toStringab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000/str lst name=explain str name=2315190010001021 5.337161 = (MATCH) weight(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000 in 378403) [DefaultSimilarity], result of: 5.337161 = fieldWeight in 378403, product of: 0.57735026 = tf(freq=0.3334), with freq of: 0.3334 = phraseFreq=0.3334 29.581549 = idf(), sum of: 1.0012436 = idf(docFreq=3297332, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 4.3826413 = idf(docFreq=112108, maxDocs=3301436) 6.3982043 = idf(docFreq=14937, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 1.0017256 = idf(docFreq=3295743, maxDocs=3301436) 0.3125 = fieldNorm(doc=378403) /str str name=2313006480001021 9.244234 = (MATCH) weight(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000 in 482807) [DefaultSimilarity], result of: 9.244234 = fieldWeight in 482807, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = phraseFreq=1.0 29.581549 = idf(), sum of: 1.0012436 = idf(docFreq=3297332, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 4.3826413 = idf(docFreq=112108, maxDocs=3301436) 6.3982043 = idf(docFreq=14937, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 1.0017256 = idf(docFreq=3295743, maxDocs=3301436) 0.3125 = fieldNorm(doc=482807) /str str name=2356410250001021 5.337161 = (MATCH) weight(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000 in 1317563) [DefaultSimilarity], result of: 5.337161 = fieldWeight in 1317563, product of: 0.57735026 = tf(freq=0.3334), with freq of: 0.3334 = phraseFreq=0.3334 29.581549 = idf(), sum of: 1.0012436 = idf(docFreq=3297332, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 4.3826413 = idf(docFreq=112108, maxDocs=3301436) 6.3982043 = idf(docFreq=14937, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 1.0017256 = idf(docFreq=3295743, maxDocs=3301436) 0.3125 = fieldNorm(doc=1317563) /str /response The used version is a 4.0 October snapshot. I have 2 questions about the result: - Why debug print and scores in result are different? - What is the expected behavior of this kind of term proximity query? - The debug scores seem to be well ordered but the result scores seem to be wrong. Thanks, Ariel
Re: Phrase between quotes with dismax edismax
OK, looks like you're mixing fieldTypes. That is, you have some string types, which are completely unanalyzed and some analyzed fields. The analyzed fields have stopwords removed at index time. Then it looks like your query chain does NOT remove stopwords or some such. So it's probably a schema issue. The admin/analysis page will help you understand how the analysis chains work. I'd also recommend that you NOT use eDismax when experimenting with analyzers, having requests distributed across all those fields can be confusing. Certainly DO use eDismax when you're working for real, or use the fielded form of the queries title:chef de projet just to reduce the clutter of the output... But you're on the right track Best Erick On Thu, Nov 17, 2011 at 8:10 AM, Jean-Claude Dauphin jc.daup...@gmail.com wrote: Thanks Erick for your prompt response. I am not sure but I think I found why the phrase chef de projet is not found by dismax and edismax. The following terms are indexed and can be seen with Luke: chef projet chef de projet When searching for the phrase chef de projet, the terms 'chef' and 'projet' are found in the index but 'de' is not found. And thus no results. Please note that using standard Lucene QueryParser, it works well. This is just what I suspect, does it sounds correct?? Best wishes, Jean-Claude On Wed, Nov 16, 2011 at 9:26 PM, Erick Erickson erickerick...@gmail.comwrote: Ah, ok I was mis-reading some things. So, let's ignore the category bits for now. Questions: 1 Can you refine down the problem. That is, demonstrate this with a single field and leave out the category stuff. Something like q=title:chef de projet getting no results and q=title:chef projet getting results? The idea is to cycle through all the fields to see if we can hone in on the problem. I'd get rid of any pf parameters of your edismax definition too. I'm after the simplest case that can demonstrate the issue. For that matter, it'd be even easier if you could make this happen with the default searcher ( solr/select?q=title:chef de projet 2 if you can do 1, please post the field definitions from your schema.xml file. One possibility is that you are removing stopwords at index time but not query time or vice-versa, but that's a wild guess. 3 Once you have a field, use the admin/analysis page to see the exact transformations that occur at index and query time to see if anything jumps out. All in all, I suspect you have a field that isn't being parsed as you expect at either index or query time, but as I said above, that's a guess. Best Erick On Wed, Nov 16, 2011 at 5:02 AM, Jean-Claude Dauphin jc.daup...@gmail.com wrote: Thanks Erick for yr quick answer. I am using Solr 3.1 1) I have set the mm parameter to 0 and removed the categories from the search. Thus the query is only for chef de projet and nothing else. But the problem remains, i.e searching for chef de projet gives no results while searching for chef projet gives the right result. Here is an excerpt from the test I made: DISMAX query (q)=(chef de projet) =The Parameters= *queryResponse*=[{responseHeader={status=0,QTime=157, params={facet=true, f.createDate.facet.date.start=NOW/DAY-6DAYS,tie=0.1, facet.limit=4, f.location.facet.limit=3, *q.alt*=*:*, facet.date.other=all, hl=true,version=2, *bq*=[categoryPayloads:category1071^1, categoryPayloads:category10055078^1, categoryPayloads:category10055405^1], fl=*,score, debugQuery=true, facet.field=[soldProvisions, contractTypeText, nafCodeText, createDate, wage, keywords, labelLocation, jobCode, organizationName, requiredExperienceLevelText], *qs*=3, qt=edismax, facet.date.end=NOW/DAY, *mm*=0, facet.mincount=1, facet.date=createDate, *qf*= title^4.0 formattedDescription^2.0 nafCodeText^2.0 jobCodeText^3.0 organizationName^1.0 keywords^3.0 location^1.0 labelLocation^1.0 categoryPayloads^1.0, hl.fl=title, wt=javabin, rows=20, start=0, *q*=(chef de projet), facet.date.gap=+1DAY, *stopwords*=false, *ps*=3}}, The Solr Response response={numFound=0 Debug Info debug={ *rawquerystring*=(chef de projet), *querystring*=(chef de projet), *--- * *parsedquery*= +*DisjunctionMaxQuery*((title:chef de projet~3^4.0 | keywords:chef de projet^3.0 | organizationName:chef de projet | location:chef de projet | formattedDescription:chef de projet~3^2.0 | nafCodeText:chef de projet^2.0 | jobCodeText:chef de projet^3.0 | categoryPayloads:chef de projet~3 | labelLocation:chef de projet)~0.1) *DisjunctionMaxQuery*((title:((chef chef) de (projet) projet)~3^4.0)~0.1) categoryPayloads:category1071 categoryPayloads:category10055078 categoryPayloads:category10055405,
What is the best approach to do reindexing on the fly?
Hi all, I'm using Solr 3.2 with DataImportHandler periodically update index every 5 min. There's an house keeping script running weekly which delete some data in the database. I'd like to incorporate the reindexing strategy with this house keeping script by: 1. Locking the DataImportHandler - not allow to perform any update on the index - by having a flag in the database, every time scheduled job trigger, it first checks for the flag before perform incremental index. 2. Run separate Solr instance, pointing to the same index and perform a clean index Now before coming to this setup, I had some options but they didn't fit very well: 1. Trigger reindexing directy in the running Solr instance - I wrap Solr with our own authentication mechanism and reindexing would be causing spike in memory usage and affect the current running apps (sitting in the same j2ee container) is the least thing I want 2. Master/Slave setup - I think this is the most proper way to do but looking at it as a long term solution, we have a time constraint so it won't work for now For the above selected strategy, would the searches be affected due to the reindexing from 2nd solr instance? Do we need to tell Solr to update new index once it's available? Any better option that I can give a try? Many thanks, Ero -- View this message in context: http://lucene.472066.n3.nabble.com/What-is-the-best-approach-to-do-reindexing-on-the-fly-tp3515948p3515948.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: FunctionQuery score=0
John wrote: Some of the results are receiving score=0 in my function and I would like them not to appear in the search results. you can use frange, and filter by score: q=ipodfq={!frange l=0 incl=false}query($q) -- André Bois-Crettez Search technology, Kelkoo http://www.kelkoo.com/
Re: FunctionQuery score=0
Doesn't seem to work. I though that FilterQueries work before the search is performed and not after... no? Debug doesn't include filter query only the below (changed a bit): BoostedQuery(boost(+fieldName:,boostedFunction(ord(fieldName),query))) On Thu, Nov 17, 2011 at 5:04 PM, Andre Bois-Crettez andre.b...@kelkoo.comwrote: John wrote: Some of the results are receiving score=0 in my function and I would like them not to appear in the search results. you can use frange, and filter by score: q=ipodfq={!frange l=0 incl=false}query($q) -- André Bois-Crettez Search technology, Kelkoo http://www.kelkoo.com/
RE: memory usage keep increase
Erick, Thanks for your reply. Yes, virtual memory does not mean physical memory. But if when virtual memory physical memory, the system will change to slow, since lots for paging request happen. Yongtao -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, November 15, 2011 8:37 AM To: solr-user@lucene.apache.org Subject: Re: memory usage keep increase I'm pretty sure not. The words virtual memory address space is important here, that's not physical memory... Best Erick On Mon, Nov 14, 2011 at 11:55 AM, Yongtao Liu y...@commvault.com wrote: Hi all, I saw one issue is ram usage keep increase when we run query. After look in the code, looks like Lucene use MMapDirectory to map index file to ram. According to http://lucene.apache.org/java/3_1_0/api/core/org/apache/lucene/store/MMapDirectory.html comments, it will use lot of memory. NOTE: memory mapping uses up a portion of the virtual memory address space in your process equal to the size of the file being mapped. Before using this class, be sure your have plenty of virtual address space, e.g. by using a 64 bit JRE, or a 32 bit JRE with indexes that are guaranteed to fit within the address space. So, my understanding is solr request physical RAM = index file size, is it right? Yongtao **Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. *
Doubts in Shards concept
Hi, I have implemented the shards concept. AFter giving the request this is how is given in the logs Nov 15, 2011 10:38:24 PM org.apache.solr.core.SolrCore execute INFO: [core2] webapp=/solr path=/select params={fl=uid,scorestart=0q=abcisShard=truewt=javabinfsv=truerows=1410version=1} hits=3396 status=0 QTime=2 Nov 15, 2011 10:38:24 PM org.apache.solr.core.SolrCore execute INFO: [db] webapp=/solr path=/select/ params={indent=onstart=1400q=abcversion=2.2rows=10} status=0 QTime=58 In db, I give start-1400 and rows as 10 and in the core2, the request is passed as start=0 and rows=1410 While browsing I came across this url https://issues.apache.org/jira/browse/SOLR-659 which gives the reason for this behaviour and a patch file to give the request as same in all the shards. I need to know whether there is any other config file that can be changed for this issue. The solr version details i am using is Solr Specification Version: 1.4.0.2010.01.13.08.09.44 Solr Implementation Version: 1.5-dev exported - yonik - 2010-01-13 08:09:44 Lucene Specification Version: 2.9.1-dev Lucene Implementation Version: 2.9.1-dev 888785 - 2009-12-09 18:03:31 Please let me know whether this solr version contains this patch. *This is how the query url looks http://localhost:8080/solr/db/select?indent=onversion=2.2q=typeFacet%3AABCshards.start=1400shards.rows=10start=1400rows=10 We noticed that the request went same for both the shards and the underlying server but no document was fetched even though the count was returned properly.* Please provide some suggestions. -- View this message in context: http://lucene.472066.n3.nabble.com/Doubts-in-Shards-concept-tp3516135p3516135.html Sent from the Solr - User mailing list archive at Nabble.com.
Migrating from Hibernate Search to Solr
I'm considering migrating from Hibernate Search to Solr, but in order to make that decision, I'd appreciate insight on the following: 1. How difficult is getting Solr up and running? With Hibernate I had to annotate a few classes and setup a config file; so it was pretty easy. 2. How can/should one secure Solr? 3. From what I've read, Solr can work with NoSql databases. The question is how well and how involved is the setup? Thanks. -Ari
Re: Highlighting with a default copy field with EdgeNGramFilterFactory
I found out the solution! I needed to also add an EdgeNGramFilterFactory to the fields that are the source of the copyField. That got the highlighting working again. -- View this message in context: http://lucene.472066.n3.nabble.com/Highlighting-with-a-default-copy-field-with-EdgeNGramFilterFactory-tp3510374p3516166.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Near Real-Time Search, Soft Commit Problem
I guess my first question is what evidence you have that Solr is unable to index fast enough? It's quite possible that your database connection is the thing that's unable to process fast enough. That's certainly a guess, but unless your documents are quite complex, 15 records/second isn't likely to cause Solr problems. You might try to run a small Java program that executes your database queries and see. The other question I'd ask is if you're absolutely sure that your delta-import query is correct? Is it possible that you're re-indexing *everything* every time? There's an interactive debugging console you can use that may help, try: http://localhost:8983/solr/admin/dataimport.jsp Best Erick On Thu, Nov 17, 2011 at 3:19 AM, Jak Akdemir jakde...@gmail.com wrote: Hi, I was trying to configure a Solr instance with the near real-time search and auto-complete capabilities. I stuck in the NRT feature. There are 15 new records per second that inserted into the database (mysql) and I indexed them with DIH. First, I tried to manage autoCommits from solrconfig.xml with the configuration below. autoCommit maxDocs1/maxDocs maxTime10/maxTime /autoCommit autoSoftCommit maxDocs15/maxDocs maxTime1000/maxTime /autoSoftCommit And the bash script below responsible for getting delta's without committing. while [ 1 ]; do wget -O /dev/null ' http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false' 2/dev/null sleep 1 done Then I run my query from browser http://localhost:8080/solr-jak/select?q=movie_name_prefix_full:dogvilledefType=luceneq.op=ORhttp://localhost:8080/solr-sprongo/select?q=movie_name_prefix_full:%221398%22defType=luceneq.op=OR But I realized that, with this configuration index files are changing every second and after a minute there are only 600 new records in Solr index while 900 new records in the database. After experienced that, I removed autoCommit and autoSoftCommit elements in solrconfig.xml And updated my bashscript as follows. But still index files are changing and solr can not syncronized with database. while [ 1 ]; do echo Soft commit applied! wget -O /dev/null ' http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false' 2/dev/null curl http://localhost:8080/solr-jak/update -H Content-Type: text/xml --data-binary 'commit softCommit=true waitFlush=false waitSearcher=false/' 2/dev/null sleep 3 done Even I decreased the pressure on Solr as 1 new record per sec. and soft commits within 6 sec. still there is a gap between index and db. Is there anything that I missed? I took a look to /get too, but it is working only for pk. If there is an example configuration list (like 1 sec for soft commit and 10 min for hard commit) as a best practice it would be great. Finally, here is my configuration. Ubuntu 11.04 JDK 1.6.0_27 Tomcat 7.0.21 Solr 4.0 2011-10-24_08-53-02 All advices are appreciated, Best Regards, Jak
Re: Migrating from Hibernate Search to Solr
On Nov 17, 2011, at 10:38 , Ari King wrote: I'm considering migrating from Hibernate Search to Solr, but in order to make that decision, I'd appreciate insight on the following: 1. How difficult is getting Solr up and running? With Hibernate I had to annotate a few classes and setup a config file; so it was pretty easy. So no Hibernate/Solr glue out there already? It'd be nice if you could use Hibernate as you do, but instead of working with the Lucene API directly it would use SolrJ. If this type of glue doesn't already exist, then that'd be the first step I think. Otherwise, you could use Solr directly, but you'll likely be unhappy with the disconnect compared to what you're used to. SolrJ supports annotations, but not to the degree that Hibernate does, and even so you'd be left to create an indexer and to wire in updates/deletes as well. 2. How can/should one secure Solr? Secure it from what? Being secure is relative, depends on what you're trying to protect from. In general, no security needs to be applied directly to Solr, but certainly protect it behind a firewall and even block all IP access except to your application. 3. From what I've read, Solr can work with NoSql databases. The question is how well and how involved is the setup? I imagine the specific nosql db's have their own Solr integration glue. But in general, it's pretty trivial to iterate over a collection of objects and send them over to Solr in one way or another. Erik
Highlighting and regex
Hi, Been wrestling with a question on highlighting (or not) - perhaps someone can help? The question is this: Is it possible, using highlighting or perhaps another more suited component, to return words/tokens from a stored field based on a regular expression's capture groups? What I was kind of thinking would happen with highlighting regex (hl.regex.pattern) - but doesn't seem to (although I am a highlighting novice), is that capture groups specified in a regex would be highlighted. For example: 1) given a field called desc 2) with a stored value of: the quick brown fox jumps over the lazy dog 3) specify a regex of: .*quick\s(\S+)\sfox.+\sthe\s(\S+)\sdog.* 4) get in the response: embrown/em and emlazy/em either as highlighting or through some other means. (I find that using hl.regex.pattern on the above yields: emthe quick brown fox jumps over the lazy dog/em) I'm guessing that I'm misinterpreting the functionality offered by highlighting, but I couldn't find much on the subject in the way of usage docs. I could write a custom highlighter or SearchComponent plugin that would do this, but is there some mechanism out there that can do this sort of thing already? It wouldn't necessarily have to be based on regex, but regex tends to be the de-facto standard for doing capture group token matching (not sure how Solr syntax would do something similar unless there were multiples, maybe?). Any insights greatly appreciated. Many thanks, Peter
Re: Solr Near Real-Time Search, Soft Commit Problem
Eric, Thank you for your response, 1) I tried 2 new records (records have only 5 field in one table) per second, in 6 sec interval too. It should be quite easy for mysql. But I will check query responses per second as you suggested. 2) I am sure about delta-queries configured well. Full-Import is completed in 40 secs for 40 docs. And delta's are in 1 sec for 15 new records. Also I checked it. There is no problem in it. Couple of evidences that drove me to think this is a configuration problem are 1- Index files are changing every second. 2- After a server restart last query results reserved. (In NRT they would disappear, right?) Please correct me if you see any problem in steps I applied for NRT. Additional specs, 32 bit OS 4 core i7-2630QM CPU @ 2.00GHz 6 GB memory Bests, Jak On Thu, Nov 17, 2011 at 10:44 AM, Erick Erickson erickerick...@gmail.comwrote: I guess my first question is what evidence you have that Solr is unable to index fast enough? It's quite possible that your database connection is the thing that's unable to process fast enough. That's certainly a guess, but unless your documents are quite complex, 15 records/second isn't likely to cause Solr problems. You might try to run a small Java program that executes your database queries and see. The other question I'd ask is if you're absolutely sure that your delta-import query is correct? Is it possible that you're re-indexing *everything* every time? There's an interactive debugging console you can use that may help, try: http://localhost:8983/solr/admin/dataimport.jsp Best Erick On Thu, Nov 17, 2011 at 3:19 AM, Jak Akdemir jakde...@gmail.com wrote: Hi, I was trying to configure a Solr instance with the near real-time search and auto-complete capabilities. I stuck in the NRT feature. There are 15 new records per second that inserted into the database (mysql) and I indexed them with DIH. First, I tried to manage autoCommits from solrconfig.xml with the configuration below. autoCommit maxDocs1/maxDocs maxTime10/maxTime /autoCommit autoSoftCommit maxDocs15/maxDocs maxTime1000/maxTime /autoSoftCommit And the bash script below responsible for getting delta's without committing. while [ 1 ]; do wget -O /dev/null ' http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false ' 2/dev/null sleep 1 done Then I run my query from browser http://localhost:8080/solr-jak/select?q=movie_name_prefix_full :dogvilledefType=luceneq.op=OR http://localhost:8080/solr-sprongo/select?q=movie_name_prefix_full:%221398%22defType=luceneq.op=OR But I realized that, with this configuration index files are changing every second and after a minute there are only 600 new records in Solr index while 900 new records in the database. After experienced that, I removed autoCommit and autoSoftCommit elements in solrconfig.xml And updated my bashscript as follows. But still index files are changing and solr can not syncronized with database. while [ 1 ]; do echo Soft commit applied! wget -O /dev/null ' http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false ' 2/dev/null curl http://localhost:8080/solr-jak/update -H Content-Type: text/xml --data-binary 'commit softCommit=true waitFlush=false waitSearcher=false/' 2/dev/null sleep 3 done Even I decreased the pressure on Solr as 1 new record per sec. and soft commits within 6 sec. still there is a gap between index and db. Is there anything that I missed? I took a look to /get too, but it is working only for pk. If there is an example configuration list (like 1 sec for soft commit and 10 min for hard commit) as a best practice it would be great. Finally, here is my configuration. Ubuntu 11.04 JDK 1.6.0_27 Tomcat 7.0.21 Solr 4.0 2011-10-24_08-53-02 All advices are appreciated, Best Regards, Jak
Re: Solr Near Real-Time Search, Soft Commit Problem
On Thu, Nov 17, 2011 at 11:48 AM, Jak Akdemir jakde...@gmail.com wrote: 2) I am sure about delta-queries configured well. Full-Import is completed in 40 secs for 40 docs. And delta's are in 1 sec for 15 new records. Also I checked it. There is no problem in it. That's 10,000 docs/sec. If you configure a soft commit for every 15 documents, that means solr is trying to do 666 commits/sec. Autocommit by number of docs rarely makes sense anymore - I'd suggest configuring both soft and hard commits based on time only. -Yonik http://www.lucidimagination.com
Implications of setting catenateAll=1
Hi, The default for catenateAll is 0 which we've been using on the WordDelimiterFilter. What would be the possibly negative implications of setting this to 1? So that: wi-fi-800 would produce the tokens: wi, fi, wifi, 800, wifi800 for example? Thanks
Re: ISO8601 Date format
: My question is: is a three digit year a valid ISO-8601 date format for : the response or is this a bug? Because other languages (f.e. python) are : throwing an exception with a three digit year?! There are some known bugs with esoteric years, but i think the one that's burning you here has been fixed in the 3x branch and will be included in 3.5... https://issues.apache.org/jira/browse/SOLR-2772 -Hoss
Re: Solr Near Real-Time Search, Soft Commit Problem
Yonik, I updated my solrconfig time based only as follows. autoCommit maxTime30/maxTime /autoCommit autoSoftCommit maxTime1000/maxTime /autoSoftCommit And changed my soft commit script to the first case. while [ 1 ]; do echo Soft commit applied! wget -O /dev/null ' http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false' 2/dev/null sleep 1 done After full-import, I inserted 420 new records in a minute. (7 new records per second) And softCommitted every second as we can see in solrconfig.xml. It seems that after all solr can return only 326 of these new 420 records. Index files should not change every second, is it true? (After inserting 420 records if I call delta-import with commit true, all these records can be seen in solr results) Thanks, Jak On Thu, Nov 17, 2011 at 12:14 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Nov 17, 2011 at 11:48 AM, Jak Akdemir jakde...@gmail.com wrote: 2) I am sure about delta-queries configured well. Full-Import is completed in 40 secs for 40 docs. And delta's are in 1 sec for 15 new records. Also I checked it. There is no problem in it. That's 10,000 docs/sec. If you configure a soft commit for every 15 documents, that means solr is trying to do 666 commits/sec. Autocommit by number of docs rarely makes sense anymore - I'd suggest configuring both soft and hard commits based on time only. -Yonik http://www.lucidimagination.com
Re: Solr Near Real-Time Search, Soft Commit Problem
Hmmm. It is suspicious that your index files change every second. If you change our cron task to update every 10 seconds, do the index files change every 10 seconds? Regarding your question about After a server restart last query results reserved. (In NRT they would disappear, right?) not necessarily. If your autoCommit interval is exceeded, the soft commits will be committed to disk so your Solr restart would pick them up after restart. But if somehow you're getting a hard commit to happen every second, you should also be seeing a lot of segment merging going on, are you? I think I'd stop the cron job and execute this manually for a while in order to see exactly where the problem is. I'd go ahead and comment out the autoCommit section as well. That should give you a much more reproducible test scenario. Say you do that, issue your delta-import and immediately kill your server. When it starts up if you then see the delta-data, we should understand why. Because it sure would seem like the commit=false isn't doing what you expect. Erick On Thu, Nov 17, 2011 at 12:41 PM, Jak Akdemir jakde...@gmail.com wrote: Yonik, I updated my solrconfig time based only as follows. autoCommit maxTime30/maxTime /autoCommit autoSoftCommit maxTime1000/maxTime /autoSoftCommit And changed my soft commit script to the first case. while [ 1 ]; do echo Soft commit applied! wget -O /dev/null ' http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false' 2/dev/null sleep 1 done After full-import, I inserted 420 new records in a minute. (7 new records per second) And softCommitted every second as we can see in solrconfig.xml. It seems that after all solr can return only 326 of these new 420 records. Index files should not change every second, is it true? (After inserting 420 records if I call delta-import with commit true, all these records can be seen in solr results) Thanks, Jak On Thu, Nov 17, 2011 at 12:14 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Nov 17, 2011 at 11:48 AM, Jak Akdemir jakde...@gmail.com wrote: 2) I am sure about delta-queries configured well. Full-Import is completed in 40 secs for 40 docs. And delta's are in 1 sec for 15 new records. Also I checked it. There is no problem in it. That's 10,000 docs/sec. If you configure a soft commit for every 15 documents, that means solr is trying to do 666 commits/sec. Autocommit by number of docs rarely makes sense anymore - I'd suggest configuring both soft and hard commits based on time only. -Yonik http://www.lucidimagination.com
Re: Solr Near Real-Time Search, Soft Commit Problem
1- There is an improvement on the issue. I add 10 seconds time interval into the delta of data-config.xml, which will cover records that already indexed. revision_time gt; DATE_SUB('${dataimporter.last_index_time}', INTERVAL 10 SECOND); In this case 1369 new records inserted with 7 records per sec frequency. Solr response shows 1369 new records successfully. 2- If I update bashscript to sleep 10 seconds and autosoftcommit to 1 sec, index files are updated every 10 seconds If I updated autosoftcommit to 10 seconds and bashscript to sleep 10 sec, index files are updated every 10 seconds In index folder after each update, I see that segments/index files are changing. I restart the server before fell into the autocommit interval. Delta's are still in the result list. Here is my solrconfig. autoCommit maxTime30/maxTime /autoCommit autoSoftCommit maxTime1000/maxTime /autoSoftCommit 4- I comment out the autocommit part. Still index files are changing. !-- autoCommit maxTime30/maxTime /autoCommit -- autoSoftCommit maxTime1000/maxTime /autoSoftCommit I did not modify the request part in all of these cases. wget -O /dev/null ' http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false' 2/dev/null #curl http://localhost:8080/solr-jak/update -H Content-Type: text/xml --data-binary 'commit softCommit=true waitFlush=false waitSearcher=false/' 2/dev/null Erick, as you mentioned, I believe that commit=false is not working properly. If you need any information, I can serve it. Thank you for all to your quick responses and advices. Bests, Jak On Thu, Nov 17, 2011 at 1:34 PM, Erick Erickson erickerick...@gmail.comwrote: Hmmm. It is suspicious that your index files change every second. If you change our cron task to update every 10 seconds, do the index files change every 10 seconds? Regarding your question about After a server restart last query results reserved. (In NRT they would disappear, right?) not necessarily. If your autoCommit interval is exceeded, the soft commits will be committed to disk so your Solr restart would pick them up after restart. But if somehow you're getting a hard commit to happen every second, you should also be seeing a lot of segment merging going on, are you? I think I'd stop the cron job and execute this manually for a while in order to see exactly where the problem is. I'd go ahead and comment out the autoCommit section as well. That should give you a much more reproducible test scenario. Say you do that, issue your delta-import and immediately kill your server. When it starts up if you then see the delta-data, we should understand why. Because it sure would seem like the commit=false isn't doing what you expect. Erick On Thu, Nov 17, 2011 at 12:41 PM, Jak Akdemir jakde...@gmail.com wrote: Yonik, I updated my solrconfig time based only as follows. autoCommit maxTime30/maxTime /autoCommit autoSoftCommit maxTime1000/maxTime /autoSoftCommit And changed my soft commit script to the first case. while [ 1 ]; do echo Soft commit applied! wget -O /dev/null ' http://localhost:8080/solr-jak/dataimport?command=delta-importcommit=false ' 2/dev/null sleep 1 done After full-import, I inserted 420 new records in a minute. (7 new records per second) And softCommitted every second as we can see in solrconfig.xml. It seems that after all solr can return only 326 of these new 420 records. Index files should not change every second, is it true? (After inserting 420 records if I call delta-import with commit true, all these records can be seen in solr results) Thanks, Jak On Thu, Nov 17, 2011 at 12:14 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Nov 17, 2011 at 11:48 AM, Jak Akdemir jakde...@gmail.com wrote: 2) I am sure about delta-queries configured well. Full-Import is completed in 40 secs for 40 docs. And delta's are in 1 sec for 15 new records. Also I checked it. There is no problem in it. That's 10,000 docs/sec. If you configure a soft commit for every 15 documents, that means solr is trying to do 666 commits/sec. Autocommit by number of docs rarely makes sense anymore - I'd suggest configuring both soft and hard commits based on time only. -Yonik http://www.lucidimagination.com
Re: Solr Near Real-Time Search, Soft Commit Problem
On Thu, Nov 17, 2011 at 1:34 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm. It is suspicious that your index files change every second. Why is this suspicious? A soft commit still writes out some files currently... it just doesn't fsync them. -Yonik http://www.lucidimagination.com
Boosting is slow
Hi all, I have about 20 million records in my solr index. I'm running into a problem now where doing a boost drastically slows down my search application. A typical query for me looks something like: http://localhost:8983/solr/mycore/search/?q=test {!boost b=product(sum(log(sum(myfield,1)),1),recip(ms(NOW,mydate_field),3.16e-11,1,8))} I've tried several variations on the boost to see if that was the problem but even when doing something simple like: http://localhost:8983/solr/mycore/search/?q=test {!boost b=2} it is still really slow. Is there a different approach I should be taking? Thanks, Brian Lamb
Re: Boosting is slow
Sorry, the query is actually: http://localhost:8983/solr/mycore/search/?q=test{!boost b=product(sum(log(sum(myfield,1)),1),recip(ms(NOW,mydate_field),3.16e-11,1,8))}start=sort=score+desc,mydate_field+descwt=xslttr=mysite.xsl On Thu, Nov 17, 2011 at 2:59 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, I have about 20 million records in my solr index. I'm running into a problem now where doing a boost drastically slows down my search application. A typical query for me looks something like: http://localhost:8983/solr/mycore/search/?q=test {!boost b=product(sum(log(sum(myfield,1)),1),recip(ms(NOW,mydate_field),3.16e-11,1,8))} I've tried several variations on the boost to see if that was the problem but even when doing something simple like: http://localhost:8983/solr/mycore/search/?q=test {!boost b=2} it is still really slow. Is there a different approach I should be taking? Thanks, Brian Lamb
Re: Solr Near Real-Time Search, Soft Commit Problem
Yonik, Is it ok to see soft committed records after server restart, too? If it is, there is no problem left at all. I added changing files and 1 sec of log at the end of the e-mail. One significant line says softCommit=true, so Solr recognizes our softCommit request. INFO: start commit(optimize=false,waitSearcher=true,expungeDeletes=false,softCommit=true) I want to fix just a little typo from my last e-mail. ... autosoftcommit to 10 seconds and bashscript to sleep 10 sec, index files are ... should be ... autosoftcommit to 10 seconds and bashscript to sleep *1 sec*, index files are ... Jak ___ jak@jak:/usr/java/solr4/data$ ls index/ _11_0.frq _11.nrm_14.fdx_16_0.tip _1b_0.prx _1b.per_1c.fnm _1d.fdt_1e_0.tim _1f.fdt _l.fdt_r_0.tim segments_2_t.fdx _11_0.prx _11.per_14.fnm_16.fdt_1b_0.tim _1c_0.frq _1c.nrm _1d.fdx_1e_0.tip _1f.fdx _l.fdx_r_0.tip segments.gen _t.fnm _11_0.tim _14_0.frq _14.nrm_16.fdx_1b_0.tip _1c_0.prx _1c.per _1d.fnm_1e.fdt_1.fnx_l.fnm_r.fdt_t_0.frq _t.nrm _11_0.tip _14_0.prx _14.per_16.fnm_1b.fdt_1c_0.tim _1d_0.frq _1d.nrm_1e.fdx_l_0.frq _l.nrm_r.fdx_t_0.prx _t.per _11.fdt_14_0.tim _16_0.frq _16.nrm_1b.fdx_1c_0.tip _1d_0.prx _1d.per_1e.fnm_l_0.prx _l.per_r.fnm_t_0.tim write.lock _11.fdx_14_0.tip _16_0.prx _16.per_1b.fnm_1c.fdt_1d_0.tim _1e_0.frq _1e.nrm_l_0.tim _r_0.frq _r.nrm_t_0.tip _11.fnm_14.fdt_16_0.tim _1b_0.frq _1b.nrm_1c.fdx_1d_0.tip _1e_0.prx _1e.per_l_0.tip _r_0.prx _r.per_t.fdt jak@jak:/usr/java/solr4/data$ ls index/ _11_0.frq _11.nrm_14.fdx_16_0.tip _1b_0.prx _1b.per_1c.fnm _1d.fdt_1e_0.tim _1f_0.frq _1f.nrm _l.fdt_r_0.tim segments_2 _t.fdx _11_0.prx _11.per_14.fnm_16.fdt_1b_0.tim _1c_0.frq _1c.nrm _1d.fdx_1e_0.tip _1f_0.prx _1.fnx_l.fdx_r_0.tip segments.gen _t.fnm _11_0.tim _14_0.frq _14.nrm_16.fdx_1b_0.tip _1c_0.prx _1c.per _1d.fnm_1e.fdt_1f_0.tim _1f.per _l.fnm_r.fdt_t_0.frq _t.nrm _11_0.tip _14_0.prx _14.per_16.fnm_1b.fdt_1c_0.tim _1d_0.frq _1d.nrm_1e.fdx_1f_0.tip _l_0.frq _l.nrm_r.fdx_t_0.prx _t.per _11.fdt_14_0.tim _16_0.frq _16.nrm_1b.fdx_1c_0.tip _1d_0.prx _1d.per_1e.fnm_1f.fdt_l_0.prx _l.per_r.fnm_t_0.tim write.lock _11.fdx_14_0.tip _16_0.prx _16.per_1b.fnm_1c.fdt_1d_0.tim _1e_0.frq _1e.nrm_1f.fdx_l_0.tim _r_0.frq _r.nrm_t_0.tip _11.fnm_14.fdt_16_0.tim _1b_0.frq _1b.nrm_1c.fdx_1d_0.tip _1e_0.prx _1e.per_1f.fnm_l_0.tip _r_0.prx _r.per_t.fdt ___ Nov 17, 2011 2:55:17 PM org.apache.solr.handler.dataimport.DataImporter doDeltaImport INFO: Starting Delta Import Nov 17, 2011 2:55:17 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr-jak path=/dataimport params={commit=falsecommand=delta-import} status=0 QTime=0 Nov 17, 2011 2:55:17 PM org.apache.solr.handler.dataimport.SimplePropertiesWriter readIndexerProperties INFO: Read dataimport.properties Nov 17, 2011 2:55:17 PM org.apache.solr.handler.dataimport.DocBuilder doDelta INFO: Starting delta collection. Nov 17, 2011 2:55:17 PM org.apache.solr.handler.dataimport.DocBuilder collectDelta INFO: Running ModifiedRowKey() for Entity: movie Nov 17, 2011 2:55:17 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Creating a connection for entity movie with URL: jdbc:mysql://localhost/imdb Nov 17, 2011 2:55:17 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Time taken for getConnection(): 8 Nov 17, 2011 2:55:17 PM org.apache.solr.handler.dataimport.DocBuilder collectDelta INFO: Completed ModifiedRowKey for Entity: movie rows obtained : 147 Nov 17, 2011 2:55:17 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=false,waitSearcher=true,expungeDeletes=false,softCommit=true) Nov 17, 2011 2:55:17 PM org.apache.solr.handler.dataimport.DocBuilder collectDelta INFO: Completed DeletedRowKey for Entity: movie rows obtained : 0 Nov 17, 2011 2:55:17 PM org.apache.solr.handler.dataimport.DocBuilder collectDelta INFO: Completed parentDeltaQuery for Entity: movie Nov 17, 2011 2:55:17 PM org.apache.solr.search.SolrIndexSearcher init INFO: Opening Searcher@1520a8e main Nov 17, 2011 2:55:17 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: end_commit_flush Nov 17, 2011 2:55:17 PM org.apache.solr.search.SolrIndexSearcher warm INFO: autowarming Searcher@1520a8emain{DirectoryReader(segments_2:1321559475026:nrt _k(4.0):C388607 _50(4.0):C526/132 _3q(4.0):C444/141 _43(4.0):C450/126 _4r(4.0):C470/125 _4e(4.0):C456/135 _3f(4.0):C428/133 _51(4.0):C132/126
Re: Solr Near Real-Time Search, Soft Commit Problem
On Thu, Nov 17, 2011 at 3:56 PM, Jak Akdemir jakde...@gmail.com wrote: Is it ok to see soft committed records after server restart, too? Yes... we currently have Jetty configured to call some cleanups on exit (such as closing the index writer). -Yonik http://www.lucidimagination.com
Re: Solr Near Real-Time Search, Soft Commit Problem
This is great! I guess, there is nothing left to worry about for a while. Erick Yonik, thank you again for your great responses. Bests, Jak On Thu, Nov 17, 2011 at 4:01 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Nov 17, 2011 at 3:56 PM, Jak Akdemir jakde...@gmail.com wrote: Is it ok to see soft committed records after server restart, too? Yes... we currently have Jetty configured to call some cleanups on exit (such as closing the index writer). -Yonik http://www.lucidimagination.com
Re: Multiple solr webapps
: According to solr wiki, an instruction to use single war file and : multiple context files (solr[1-2].xml). ... : I wonder why following structure is not enough. I think this is : the simplest way (disk space is a bit more necessary, of course): ...there's nothing stoping you from actually cloning the entire webapp, but there is also no good reason for it. you still have to to use something like JNDI to configure the individual webapp instances to know what solr home dir to use. : I noticed this caution on the wiki page: : : Don't put anything related to Solr under the webapps directory. : : Can someone tell me why don't put anything related to solr under : the webapps? Is this the reason why single war file configuration : is recommended ? Because tomcat does bad things if you have both a webapps/foo/ (or a webapps/foo.ar) and a context file named foo.xml. i don't remember what exactly the problem is, but they are intended to be mutually exclusive -- ie: either you use the context file and point to the war outside of the webapps dir, or you use the webapps dir -- not both. -Hoss
Re: two word phrase search using dismax
: After putting the same score for title and content in qf filed, docs : with both words in content moved to fifth place. The doc in the first, : third and fourth places still have only one of the words in content and : title. The doc in the second place has one of the words in title and : both words in the content but in different places not together. details matter -- if you send futher followup mails the full details of your dismax options and the score explanations for debugQuery are neccessary to be sure people understand what you are describing (a snapshot of reality is far more valuable then a vague description of reality) off hand what you are describing sounds correct -- this is what the dismax parser is really designed to do. even if you have given both title and content equal boosts, your title field is probably shorter then your content field, so words matching once in title are likly to score higher then the same word matching once in content due to length normalization -- and unless you set the tie param to something really high, the score contribution from the highest scoring field (in this case title) will be the dominant factor in the score (it's disjunction *max* by default ... if you make tie=1 then it's disjunction *sum*) you haven't mentioned anything about hte pf param at all which i can only assume means you aren't using it -- the pf param is how you configure that scores should be increased if/when all of the words in teh query string appear together. I would suggest putting all of the fields in your qf param in your pf param as well. -Hoss
Re: FunctionQuery score=0
: I am using a function query that based on the query of the user gives a : score for the results I am presenting. please be specific -- it's not at all clear what the structure of your query is, and the details matter. : Some of the results are receiving score=0 in my function and I would like : them not to appear in the search results. this sounds expected given how functions work: by definition they match all documents, even if one of the inputs to the function is a query that only matches some documents. you either need to use that query as a filter to constrain the set of documents returned, or you need to restructure your main query using something like the {!boost} parser (which only matches documents from it's nested query). if you give us an actual example of what you are doing, we can give you suggestions on how to change it to achieve what you want. -Hoss
Re: Solr Master High Availability
Look at the repeater setup on the replication page, and instead of repeater, think backup master. But you don't really need to even do this. You can simply provision yourself an extra slave. Now, if you master goes south, you can reconfigure any slave as the new master by just putting the configuration file you used for the master on it and pointing the remaining slaves at the new master. Provision another slave and point it at the new master and you're right back where you started. But you have one other worry. What if your master goes south in such a way that the index in unusable? Solr/Lucene have a lot of safeguards built in to prevent this, but You can consider setting up the replicator mentioned above with a deletion policy that keeps, say, 1 or 2 copies of the old index around. Then only replicating, say, every day. That way, you have a couple of days to notice the problem and a viable index to use again. Under any circumstances, you need to create a mechanism whereby you can re-index from a known good point. At the very least, if your master goes down you may have uncommitted documents. Even if you have all your documents committed, you still have to worry about the polling interval to your backup master. So you should be ready to re-index from the last known good point. But assuming you have a uniqueKey defined, there's no problem with re-indexing documents already in the index, the old copy will just be replaced. Best Erick On Thu, Nov 17, 2011 at 7:30 AM, KARHU Toni toni.ka...@ext.oami.europa.eu wrote: Hi, im looking into High availability SOLR master configurations. Does anybody have a good solution to this the things im looking into are: * Using SOLR replication to keep a second backup master. * Indexing in a separate machine(s), problem being here that the index will be different from the other machine needing a full replication to all slaves in case of failure to first master. * Having the whole setup replicated to another machine which is then used as a master machine if primary master fails? Any more ideas/experiences? Toni ** IMPORTANT: This message is intended exclusively for information purposes. It cannot be considered as an official OHIM communication concerning procedures laid down in the Community Trade Mark Regulations and Designs Regulations. It is therefore not legally binding on the OHIM for the purpose of those procedures. The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete or destroy this message and any copies. **
Re: What is the best approach to do reindexing on the fly?
Hmmm, the master/slave setup takes about a day to get completely running assuming that you don't have any experience to start with, so you may be able to fit that in your schedule. Otherwise, you won't be able to avoid the memory and CPU spikes. But there's another option. It's actually quite easy to write a SolrJ program that you can do anything you want in, including examining your tables for locking. But there's also another option. Create a trigger on your tables that inserts what you use to create Solr's uniqueKey in a modified table. Have your SolrJ program simply query that table and delete/update as required to keep the single index in sync with the database Of course, all that depends on how long it takes to re-index from scratch. If it's reasonably quick, perhaps simply re-indexing at 3:00 AM (or whatever) would work Best Erick On Thu, Nov 17, 2011 at 9:34 AM, erolagnab trung@gmail.com wrote: Hi all, I'm using Solr 3.2 with DataImportHandler periodically update index every 5 min. There's an house keeping script running weekly which delete some data in the database. I'd like to incorporate the reindexing strategy with this house keeping script by: 1. Locking the DataImportHandler - not allow to perform any update on the index - by having a flag in the database, every time scheduled job trigger, it first checks for the flag before perform incremental index. 2. Run separate Solr instance, pointing to the same index and perform a clean index Now before coming to this setup, I had some options but they didn't fit very well: 1. Trigger reindexing directy in the running Solr instance - I wrap Solr with our own authentication mechanism and reindexing would be causing spike in memory usage and affect the current running apps (sitting in the same j2ee container) is the least thing I want 2. Master/Slave setup - I think this is the most proper way to do but looking at it as a long term solution, we have a time constraint so it won't work for now For the above selected strategy, would the searches be affected due to the reindexing from 2nd solr instance? Do we need to tell Solr to update new index once it's available? Any better option that I can give a try? Many thanks, Ero -- View this message in context: http://lucene.472066.n3.nabble.com/What-is-the-best-approach-to-do-reindexing-on-the-fly-tp3515948p3515948.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: ExtractingRequestHandler HTTP GET Problem
: indexed file. The CommonsHttpSolrServer sends the parameters as a HTTP : GET request. Because of that I'll get a socket write error. If I : change the CommonsHttpSolrServer to send the parameters as HTTP POST : sending will work, but the ExtractingRequestHandler will not recognize : the parameters. If I'm using the EmbeddedSolrServer there is no that doesn't sound right ... if all you do is configure CommonsHttpSolrServer to use POST instead of GET it shouldn't change anything about how ExtractingRequestHandler is executed. can you provide the code youhave that uses CommonsHttpSolrServer, and info on how you have configured ExtractingRequestHandler so we can better understand what exactly you are doing? in general it seems weird to me thta you are base64 encoding some text and then sending it to the ExtractingRequestHandler -- why exactly aren't you just sending hte text as is? (is there some special feature of Tika i'm not aware of that only works if you feed it base 64 encoded data?) -Hoss
Re: Migrating from Hibernate Search to Solr
So no Hibernate/Solr glue out there already? It'd be nice if you could use Hibernate as you do, but instead of working with the Lucene API directly it would use SolrJ. If this type of glue doesn't already exist, then that'd be the first step I think. Otherwise, you could use Solr directly, but you'll likely be unhappy with the disconnect compared to what you're used to. SolrJ supports annotations, but not to the degree that Hibernate does, and even so you'd be left to create an indexer and to wire in updates/deletes as well. How involved/difficult would you describe using Solr directly is? I have no experience with Solr, but from what you described it doesn't sound too bad. -Ari
[ANNOUNCEMENT] Second Edition of the First Book on Solr
Fellow Solr users, I am proud to announce that the book Apache Solr 3 Enterprise Search Server is officially published! This is the second edition of the first book on Solr by me, David Smiley, and my co-author Eric Pugh. You can find full details about the book, download a free chapter, and purchase it here: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book It is also available through other channels like Amazon. You can feel good about the purchase knowing that 5% of each sale goes to support the Apache Software Foundation. If you buy directly from the publisher, then the basis of the percentage that goes to the ASF (and to me) is higher than if you buy it through other channels. This book naturally covers the latest features in Solr as of version 3.4 like Result Grouping and Geospatial, but this is not a small update to the first book. We have more experience with Solr and we've listened to reader feedback from the first edition. No chapter was untouched: Faceting gets its own chapter, all search relevancy matters are discussed in one chapter, auto-complete approaches are all discussed together, much of the chapter on integration was rewritten to discuss newer technologies, and the first chapter was greatly streamlined. Furthermore, each chapter has a tip in the introduction that advises readers in a hurry on what parts should be read now or later. Finally, we developed a 2-page parameter quick-reference appendix that you will surely find useful printed on your desk. In summary, we improved the existing content, and added about 25% more by page count. Software, errata, and other information about this book and the previous edition is on our website: http://www.solrenterprisesearchserver.com/ We've been working hard on this book for the last 10 months and we hope it really helps saves you time and improves your search project! Apache Solr 3 Enterprise Search Server In Detail: If you are a developer building an app today then you know how important a good search experience is. Apache Solr, built on Apache Lucene, is a wildly popular open source enterprise search server that easily delivers powerful search and faceted navigation features that are elusive with databases. Solr supports complex search criteria, faceting, result highlighting, query-completion, query spell-check, relevancy tuning, and more. Apache Solr 3 Enterprise Search Server is a comprehensive reference guide for every feature Solr has to offer. It serves the reader right from initiation to development to deployment. It also comes with complete running examples to demonstrate its use and show how to integrate Solr with other languages and frameworks. Through using a large set of metadata about artists, releases, and tracks courtesy of the MusicBrainz.org project, you will have a testing ground for Solr, and will learn how to import this data in various ways. You will then learn how to search this data in different ways, including Solr's rich query syntax and boosting match scores based on record data. Finally, we'll cover various deployment considerations to include indexing strategies and performance-oriented configuration that will enable you to scale Solr to meet the needs of a high-volume site. Sincerely, David Smiley (primary author) david.w.smi...@gmail.com Eric Pugh (co-author) ep...@opensourceconnections.com