Re: Merge tool based on mergefactor
On 06/19/2013 03:21 AM, Otis Gospodnetic wrote: You could call the optimize command directly on slaves, but specify the target number of segments, e.g. /solr/update?optimize=truemaxSegments=10 Not sure I recommend doing this on slaves, but you could - maybe you have spare capacity. You may also want to consider not doing it on all your slaves at the same time... IIUC this assumes your slaves do not replicate too often, otherwise replication would reset the index to whatever number of segments the master has. You could still perform an optimize with maxSegments after every replication, if it's acceptable in the situation you are in. However, if you need slaves to update every 2-5 minutes, that would be impractical and wasteful. Is this correct? If so, how to find a fair compromise/balance between master and slave merge factors if you need very frequent indexing of new documents (say continuous) on the master and up-to-date indexes on the slaves (say 2-5' pollInterval)? -- Cosimo
Adding documents in Solr plugin
I have a core with millions of records. I want to add a custom handler which scan the existing documents and update one of the field (delete and add document) based on a condition (age12 for example). All fields are stored so there is no problem to recreate the document from the search result. I prefer doing it on the Solr server side for avoiding sending millions of documents to the client and back. I'm thinking of writing a solr plugin which will receive a query and update some fields on the query documents (like the delete by query handler). Are existing solutions or better alternatives? I couldn't find any examples of Solr plugins which update / add / delete documents (I don't need to extend the update handler). If someone has an example it will be great help. Thanks in advance
Re: Adding documents in Solr plugin
This could be a very useful feature. To do it properly, you'd want some new update syntax, extending that of the atomic updates. That is, a new custom request handler could do it, but might now be the best way. If I were to try this, I'd look into the atomic update tickets in JIRA and see what code they touched. See if you can find a way to add something there. Upayavira On Wed, Jun 19, 2013, at 08:52 AM, Avner Levy wrote: I have a core with millions of records. I want to add a custom handler which scan the existing documents and update one of the field (delete and add document) based on a condition (age12 for example). All fields are stored so there is no problem to recreate the document from the search result. I prefer doing it on the Solr server side for avoiding sending millions of documents to the client and back. I'm thinking of writing a solr plugin which will receive a query and update some fields on the query documents (like the delete by query handler). Are existing solutions or better alternatives? I couldn't find any examples of Solr plugins which update / add / delete documents (I don't need to extend the update handler). If someone has an example it will be great help. Thanks in advance
Disable Replication for all Cores in a single Command
Hello Folks, is it possible to disable the replication for ALL cores using one command? We currently use Solr 3.6. Currently we have a curl operation, which fires: http://slave_host:port/solr/core/admin/replication/index.jsp?poll=disable In the documentation there is a URL-Command which seems to be correct, but it says 404. http://slave_host:port/solr/replication?command=disablepoll Since we have many cores and many Servers, this takes a while. Thanks, Ralf
UnInverted multi-valued field
Hi @all. We have the problem that after an update the index takes to much time for 'warm up'. We have some multivalued facet-fields and during the startup solr creates the messages: INFO: UnInverted multi-valued field {field=mt_facet,memSize=18753256,tindexSize=54,time=170,phase1=156,nTerms=17,bigTerms=3,termInstances=903276,uses=0} In the solconfig we use the facet.method 'fc'. We know, that the start-up with the method 'enum' is faster, but then the searches are very slow. How do you handle this problem? Or have you any idea for optimizing the warm up? Or what do you do after an update? Greetings Jochen -- Dr. rer. nat. Jochen Lienhard Dezernat EDV Albert-Ludwigs-Universität Freiburg Universitätsbibliothek Rempartstr. 10-16 | Postfach 1629 79098 Freiburg | 79016 Freiburg Telefon: +49 761 203-3908 E-Mail: lienh...@ub.uni-freiburg.de Internet: www.ub.uni-freiburg.de
Re: Solr string field stripping new lines line breaks
Dears, My english is bad. But I will try to explain. I have indexed databases and files. The files included : docx, pdf, txt. Then I have indexed all of data. But my indexed document pdf files text all of through continued. I try to appear line break text. Document files text line breaks to indexed document also line breaks. My frontend app is SOLARIUM. How can I appear line break the indexed data? Please assist me on this. Thank you -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-string-field-stripping-new-lines-line-breaks-tp3984384p4071595.html Sent from the Solr - User mailing list archive at Nabble.com.
getting different search results for words with same meaning in Japanese language
Hi, we have two japanese words with the same meaning ソフトウェア and ソフトウエア (notice the difference in capital I looking character - word meaning is 'software' in the english language). When ソフトウェア is searched, it gives around 8 search results but when ソフトウエア is searched, it gives only 2 search results. The japanese translator told that this is something called yugari (which means that the above words can be seen as authorise and authorize, so they should yield same search results as they have same meaning but spelled differently). we have one solution to this issue - to use synonyms.txt and place all these similar words in this text file. This solved our problem to some extent but, in real time scenario, we do not have all the japanese technical words like software, product, technology, and so on and we cannot keep updating synonyms.txt on a daily basis. Is there any better solution, so that all the similar japanese words give same search results ? Any help is greatly appreciated. -- Regards, Yash Sharma Sr. Software Engineer | y...@osscube.com | +91-9873200649 OSSCube Solutions Pvt. Ltd. Noida A-42/6, Sector-62 Noida-201301 (UP)
Solr Suggest does not work in solrcloud environment
Hi Guys I am having difficulties running a suggest Search Handler in a solrcloud environment. The configuration was tested on a standalone machine and works fine there. Here is my configuration: *Schema.xml* field name=suggest type=suggest_text indexed=true stored=false multiValued=true / copyField source=field1 dest=suggest / copyField source=field2 dest=suggest / copyField source=field3 dest=suggest / ... fieldType name=suggest_text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.SynonymFilterFactory synonyms=synonym.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopword.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory / filter class=solr.KeywordMarkerFilterFactory protected=protword.txt / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopword.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory / filter class=solr.KeywordMarkerFilterFactory protected=protword.txt / /analyzer /fieldType *Solrconfig.xml* searchComponent class=solr.SpellCheckComponent name=suggest str name=queryAnalyzerFieldTypesuggest_text/str lst name=spellchecker str name=namesuggest/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str str name=fieldsuggest/str float name=threshold0/float str name=buildOnCommittrue/str /lst lst name=spellchecker str name=namedefault/str str name=fieldsuggest/str str name=classnamesolr.DirectSolrSpellChecker/str str name=distanceMeasureinternal/str float name=accuracy0.2/float int name=maxEdits2/int int name=minPrefix1/int int name=maxInspections50/int int name=minQueryLength2/int float name=maxQueryFrequency0.01/float /lst lst name=spellchecker str name=namewordbreak/str str name=classnamesolr.WordBreakSolrSpellChecker/str str name=fieldsuggest/str str name=combineWordstrue/str str name=breakWordstrue/str int name=maxChanges10/int /lst /searchComponent requestHandler class=org.apache.solr.handler.component.SearchHandler name=/suggest lst name=defaults str name=spellchecktrue/str str name=spellcheck.dictionarydefault/str str name=spellcheck.dictionarywordbreak/str str name=spellcheck.dictionarysuggest/str str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.count10/str str name=spellcheck.collatetrue/str /lst arr name=components strsuggest/str /arr /requestHandler As soon as I post a query on http://url.com:8983/solr/mycore/suggest?q=barwt=json I get an empty answer {responseHeader:{status:0,QTime:0}} No errors or warnings in the log. Any ideas? Simon -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Suggest-does-not-work-in-solrcloud-environment-tp4071587.html Sent from the Solr - User mailing list archive at Nabble.com.
how to reterieve all results from lucene searcher.search() method
hello, Is there any way to get all the search result. In lucene we get top documents by giving the limit like top 100,1000... etc. but if i want to get all results. How can I achieve that?? Query qu = new QueryParser(Version.LUCENE_36,field, analyzer).parse(query); TopDocs hits = searcher.search(qu,1000);
Re: SOLR Cloud - Disable Transaction Logs
Right, NRT is not tied to cloud, but it is tied to the update log. And you bring up an interesting issue when you talk about avilibility zones. SolrCloud is fairly chatty in that all of the nodes need to talk to all the other nodes in the network and they will. If the nodes are separated by an expensive connection (however you measure expensive, latency or cost to use or) then this may well be a bottleneck. For instance, the leader needs to talk to every one of its followers for an update. Imagine a leader in zone1 and all 15 replicas in zone2. Now the expensive pipe will be used 15 times to send the update. Same for queries, there's an internal software load balancer that sends queries to one node in each shard with no control over what zone it's in. The same argument applies to separate physical data centers FWIW. We're largely speculating that this may lead to bottlenecks, but it's something to keep in mind. There are thoughts about making SolrCloud rack aware in a way that will ameliorate this, but nobody has had time to work on this yet. We'd _love_ to hear about any real-life experience in this area! Best Erick On Tue, Jun 18, 2013 at 4:37 PM, Rishi Easwaran rishi.easwa...@aol.com wrote: Erick, We at AOL mail have been using SOLR for quiet a while and our system is pretty write heavy and disk I/O is one of our bottlenecks. At present we use regular SOLR in the lotsOfCore configuration and I am in the process of benchmarking SOLR cloud for our use case. I don't have concrete data that tLogs are placing lot of load on the system, but for a large scale system like ours even minimal load gets magnified. From the Cloud design, for a properly set up cluster, usually you have replicas at different availability zones . Probablity of losing more than 1 availability zone at any given time should be pretty low. Why have tLogs if all replicas on an update get the request anyway, In theory 1 replica must be able to commit eventually. NRT is an optional feature and probably not tied to Cloud, correct? Thanks, Rishi. -Original Message- From: Erick Erickson erickerick...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Tue, Jun 18, 2013 4:07 pm Subject: Re: SOLR Cloud - Disable Transaction Logs bq: the replica can take over and maintain a durable state of my index This is not true. On an update, all the nodes in a slice have already written the data to the tlog, not just the leader. So if a leader goes down, the replicas have enough local info to insure that data is not lost. Without tlogs this would not be true since documents are not durably saved until a hard commit. tlogs save data between hard commits. As Yonik explained to me once, soft commits are about visibility, hard commits are about durability and tlogs fill up the gap between hard commits. So to reinforce Shalin's comment yes, you can disable tlogs if 1 you don't want any of SolrCloud's HA/DR capabilities 2 NRT is unimportant IOW if you're using 4.x just like you would 3.x in terms of replication, HA/DR, etc. This is perfectly reasonable, but don't get hung up on disabling tlogs. And you haven't told us _why_ you want to do this. They don't consume much memory or disk space unless you have configured your hard commits (with openSearcher true or false) to be quite long. Do you have any proof at all that the tlogs are placing enough load on the system to go down this road? Best Erick On Tue, Jun 18, 2013 at 10:49 AM, Rishi Easwaran rishi.easwa...@aol.com wrote: SolrJ already has access to zookeeper cluster state. Network I/O bottleneck can be avoided by parallel requests. You are only as slow as your slowest responding server, which could be your single leader with the current set up. Wouldn't this lessen the burden of the leader, as he does not have to maintain transaction logs or distribute to replicas? -Original Message- From: Shalin Shekhar Mangar shalinman...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Tue, Jun 18, 2013 2:05 am Subject: Re: SOLR Cloud - Disable Transaction Logs Yes, but at what cost? You are thinking of replacing disk IO with even more slower network IO. The transaction log is a append-only log -- it is not pretty cheap especially so if you compare it with the indexing process. Plus your write request/sec will drop a lot once you start doing synchronous replication. On Tue, Jun 18, 2013 at 2:18 AM, Rishi Easwaran rishi.easwa...@aol.comwrote: Shalin, Just some thoughts. Near Real time replication- don't we use solrCmdDistributor, which send requests immediately to replicas with a clonedRequest, as an option can't we achieve something similar form CloudSolrserver in Solrj instead of leader doing it. As long as 2 nodes receive writes and acknowledge. durability should be high. Peer-Sync and Recovery - Can we achieve that merging indexes from leader as needed,
Re: Solr cloud: zkHost in solr.xml gets wiped out
Thanks for the confirmation! I was wondering where these bits came from wt=javabin version=2 since I wasn't seeing them, but you mentioned SolrCloud, so that explains things. It'll be tonight before I commit the fix I'm afraid, I'm traveling and need to put one more test in. Best Erick On Tue, Jun 18, 2013 at 5:47 PM, Al Wold alw...@alwold.com wrote: I just finished a test with the patch, and it looks like all is working well. On Jun 18, 2013, at 12:19 PM, Al Wold wrote: For the CREATE call, I'm doing it manually per the instructions here: http://wiki.apache.org/solr/SolrCloud Here's the exact URL I'm using: http://asu-solr-cloud.elasticbeanstalk.com/admin/collections?action=CREATEname=directorynumShards=2replicationFactor=2maxShardsPerNode=2 I'm testing out your patch now, and I'll let you know how it goes. Thanks for all the help! -Al On Jun 18, 2013, at 6:47 AM, Erick Erickson wrote: OK, I think I see what's happening. If you do NOT specify an instanceDir on the create (and I'm doing this via the core admin interface, not SolrJ) then the default is used, but not persisted. If you _do_ specify the instance dir, it will be persisted. I've put up another quick patch (tested only in my test case, running full suite now). Can you give it a whirl? You'll have to apply the patch over top of the current 4x, een though the patch is for trunk it applied to 4x cleanly for me and the tests ran. Thanks, Erick On Tue, Jun 18, 2013 at 9:02 AM, Erick Erickson erickerick...@gmail.com wrote: OK, I put up a very preliminary patch attached to the bug if you want to try it out that addresses the extra junk being put in the core tag. Doesn't address the instanceDir issue since I haven't reproduced it yet. Erick On Tue, Jun 18, 2013 at 8:46 AM, Erick Erickson erickerick...@gmail.com wrote: Whoa! What's this junk? qt=/admin/cores wt=javabin version=2 That shouldn't be being preserved, and the instancedir should be! So I'm guessing you're using SolrJ to create the core, but I just reproduced the problem (at least the 'wt=json ') bit from the browser and even from one of my internal tests when I added extra parameters. That said, instanceDir is being preserved in my test, so I'm not seeing everything you're seeing, could you cut/paste your create code? I'll see if I can set up a test case for SolrJ to catch this too. See SOLR-4935 Thanks for reporting! On Mon, Jun 17, 2013 at 5:39 PM, Al Wold alw...@alwold.com wrote: Hi Erick, I tried out your changes from the branch_4x branch. It looks good in terms of preserving the zkHost, but I'm running into an exception because it isn't persisting the instanceDir attribute on the core element. I've got a few other things I need to take care of, but as soon as I have time I'll dig in and see if I can figure out what's going on, and see what changed to make this not work. Here are details on what the files looked like before/after CREATE call: original solr.xml: ?xml version=1.0 encoding=UTF-8 ? solr persistent=true sharedLib=lib zkHost=10.116.249.136:2181 !-- this 8080 might need to change in production -- cores adminPath=/admin/cores zkClientTimeout=2 hostPort=8080 hostContext=// /solr here's what was produced with 4.3 branch + a quick mod to preserve zkHost: ?xml version=1.0 encoding=UTF-8 ? solr persistent=true zkHost=10.116.249.136:2181 sharedLib=lib cores adminPath=/admin/cores zkClientTimeout=2 hostPort=8080 hostContext=/ core loadOnStartup=true shard=shard1 instanceDir=directory_shard1_replica1/ transient=false name=directory_shard1_replica1 collection=directory/ core loadOnStartup=true shard=shard2 instanceDir=directory_shard2_replica1/ transient=false name=directory_shard2_replica1 collection=directory/ /cores /solr here's what was produced with branch_4x 4.4-SNAPSHOT: ?xml version=1.0 encoding=UTF-8 ? solr persistent=true zkHost=10.116.249.136:2181 sharedLib=lib cores adminPath=/admin/cores zkClientTimeout=2 distribUpdateSoTimeout=0 distribUpdateConnTimeout=0 hostPort=8080 hostContext=/ core shard=shard1 numShards=2 name=directory_shard1_replica2 collection=directory qt=/admin/cores wt=javabin version=2/ core shard=shard2 numShards=2 name=directory_shard2_replica2 collection=directory qt=/admin/cores wt=javabin version=2/ /cores /solr and here's the error from solr.log after restarting after the CREATE: 2013-06-17 21:37:07,083 1874 [pool-2-thread-1] ERROR org.apache.solr.core.CoreContainer - null:java.lang.NullPointerException: Missing required 'instanceDir' at org.apache.solr.core.CoreDescriptor.doInit(CoreDescriptor.java:133) at org.apache.solr.core.CoreDescriptor.init(CoreDescriptor.java:87) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:365) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:221) at
Re: How to define my data in schema.xml
Well, Avoiding flattening the db to a flat table sounds like a great plan. I found this solution http://wiki.apache.org/solr/DataImportHandler#Full_Import_Example import.a join. not handling a flat table. On Tue, Jun 18, 2013 at 5:53 PM, Jack Krupansky j...@basetechnology.comwrote: You can in fact have multiple collections in Solr and do a limited amount of joining, and Solr has multivalued fields as well, but none of those techniques should be used to avoid the process of flattening and denormalizing a relational data model. It is hard work, but yes, it is required to use Solr effectively. Again, start with the queries - what problem are you trying to solve. Nobody stores data just for the sake of storing it - how will the data be used? -- Jack Krupansky -Original Message- From: Mysurf Mail Sent: Tuesday, June 18, 2013 9:58 AM To: solr-user@lucene.apache.org Subject: Re: How to define my data in schema.xml Hi Jack, Thanks, for you kind comment. I am truly in the beginning of data modeling my schema over an existing working DB. I have used the school-teachers-student db as an example scenario. (a, I have written it as a disclaimer in my first post. b. I really do not know anyone that has 300 hobbies too.) In real life my db is obviously much different, I just used this as an example of potential pitfalls that will occur if I use my old db data modeling notions. obviously, the old relational modeling idioms do not apply here. Now, my question was referring to the fact that I would really like to avoid a flat table/join/view because of the reason listed above. So, my scenario is answering a plain user generated text search over a MSSQLDB that contains a few 1:n relation (and a few 1:n:n relationship). So, I come here for tips. Should I use one combined index (treat it as a nosql source) or separate indices or another. any other ways to define relation data ? Thanks. On Tue, Jun 18, 2013 at 4:30 PM, Jack Krupansky j...@basetechnology.com* *wrote: It sounds like you still have a lot of work to do on your data model. No matter how you slice it, 8 billion rows/fields/whatever is still way too much for any engine to search on a single server. If you have 8 billion of anything, a heavily sharded SolrCloud cluster is probably warranted. Don't plan ahead to put more than 100 million rows on a single node; plan on a proof of concept implementation to determine that number. When we in Solr land say flattened or denormalized, we mean in an intelligent, smart, thoughtful sense, not a mindless, mechanical flattening. It is an opportunity for you to reconsider your data models, both old and new. Maybe data modeling is beyond your skill set. If so, have a chat with your boss and ask for some assistance, training, whatever. Actually, I am suspicious of your 8 billion number - change each of those 300's to realistic, average numbers. Each teacher teaches 300 courses? Right. Each Student has 300 hobbies? If you say so, but... Don't worry about schema.xml until you get your data model under control. For an initial focus, try envisioning the use cases for user queries. That will guide you in thinking about how the data would need to be organized to satisfy those user queries. -- Jack Krupansky -Original Message- From: Mysurf Mail Sent: Tuesday, June 18, 2013 2:20 AM To: solr-user@lucene.apache.org Subject: Re: How to define my data in schema.xml Thanks for your reply. I have tried the simplest approach and it works absolutely fantastic. Huge table - 0s to result. two problems as I described earlier, and that is what I try to solve: 1. I create a flat table just for solar. This requires maintenance and develop. Can I run solr over my regular tables? This is my simplest approach. Working over my relational tables, 2. When you query a flat table by school name, as I described, if the school has 300 student, 300 teachers, 300 with 300 teacherCourses, 300 studentHobbies, you get 8.1 Billion rows (300*300*300*300). As I am sure this will work great on solar - searching for the school name will retrieve 8.1 B rows. 3. Lets say all my searches are user generated free text search that is searching name and comments columns. Thanks. On Tue, Jun 18, 2013 at 7:32 AM, Gora Mohanty g...@mimirtech.com wrote: On 18 June 2013 01:10, Mysurf Mail stammail...@gmail.com wrote: Thanks for your quick reply. Here are some notes: 1. Consider that all tables in my example have two columns: Name Description which I would like to index and search. 2. I have no other reason to create flat table other than for solar. So I would like to see if I can avoid it. 3. If in my example I will have a flat table then obviously it will hold a lot of rows for a single school. By searching the exact school name I will likely receive a lot of rows. (my flat table has its own pk) Yes, all of this is
Re: PostingsSolrHighlighter not working on Multivalue field
Well, _how_ does it fail? unless it's a type it should be multiValued (not capital 'V'). This probably isn't the problem, but just in case. Anything in the logs? What is the field definition? Did you re-index after changing to multiValued? Best Erick On Tue, Jun 18, 2013 at 11:01 PM, Floyd Wu floyd...@gmail.com wrote: In my test case, it seems this new highlighter not working. When field set multivalue=true, the stored text in this field can not be highlighted. Am I miss something? Or this is current limitation? I have no luck to find any documentations mentioned this. Floyd
Re: Solr string field stripping new lines line breaks
First, please start a new thread when you change the topic, doing so makes the threads easier to track. But what is your evidence that line breaks are stripped? The stored data is a verbatim copy of the data that went in to the field, nothing at all is changed. So one of several things is happening 1 they may be being stripped by whatever turns the PDF into a Solr document, SOLARIUM? 2 if you're displaying them in a browser, the line breaks may be there but just being ignored by the browser. You could write a very brief SolrJ program or similar and see the raw output by getting the data directly from your index... Best Erick On Wed, Jun 19, 2013 at 5:50 AM, sodoo first...@yahoo.com wrote: Dears, My english is bad. But I will try to explain. I have indexed databases and files. The files included : docx, pdf, txt. Then I have indexed all of data. But my indexed document pdf files text all of through continued. I try to appear line break text. Document files text line breaks to indexed document also line breaks. My frontend app is SOLARIUM. How can I appear line break the indexed data? Please assist me on this. Thank you -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-string-field-stripping-new-lines-line-breaks-tp3984384p4071595.html Sent from the Solr - User mailing list archive at Nabble.com.
Sharding and Replication
Hi, I had questions on implementation of Sharding and Replication features of Solr/Cloud. 1. I noticed that when sharding is enabled for a collection - individual requests are sent to each node serving as a shard. 2. Replication too follows above strategy of sending individual documents to the nodes serving as a replica. I am working with a system that requires massive number of writes - I have noticed that due to above reason - the cloud eventually starts to fail (Even though I am using a ensemble). I do understand the reason behind individual updates - but why not batch them up or give a option to batch N updates in either of the above case - I did come across a presentation that talked about batching 10 updates for replication at least, but I do not think this is the case. - Asif
Re: Solr Suggest does not work in solrcloud environment
Hi, Check the obvious first, that you have rebuilt reloaded the suggest dictionary individually on all nodes. Also the other checks here: http://stackoverflow.com/questions/6653186/solr-suggester-not-returning-any-results Then, try with one of query component OR distrib=false setting: http://lucene.472066.n3.nabble.com/SolrCloud-vs-distributed-suggester-td4041859.html Your suggester entry seems a little bloated. Check if the commented portions are needed: searchComponent class=solr.SpellCheckComponent name=suggest str name=queryAnalyzerFieldTypesuggest_text/str lst name=spellchecker str name=namesuggest/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str str name=fieldsuggest/str float name=threshold0/float str name=buildOnCommittrue/str /lst !-- !!! Do you need these? !!! lst name=spellchecker str name=namedefault/str str name=fieldsuggest/str str name=classnamesolr.DirectSolrSpellChecker/str str name=distanceMeasureinternal/str float name=accuracy0.2/float int name=maxEdits2/int int name=minPrefix1/int int name=maxInspections50/int int name=minQueryLength2/int float name=maxQueryFrequency0.01/float /lst lst name=spellchecker str name=namewordbreak/str str name=classnamesolr.WordBreakSolrSpellChecker/str str name=fieldsuggest/str str name=combineWordstrue/str str name=breakWordstrue/str int name=maxChanges10/int /lst -- /searchComponent requestHandler class=org.apache.solr.handler.component.SearchHandler name=/suggest lst name=defaults str name=spellchecktrue/str !-- !!! Do you need these? !!! str name=spellcheck.dictionarydefault/str str name=spellcheck.dictionarywordbreak/str -- str name=spellcheck.dictionarysuggest/str str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.count10/str str name=spellcheck.collatetrue/str /lst arr name=components strsuggest/str !-- !!! Add query component here !!!-- strquery/str /arr /requestHandler Regards, Aloke On Wed, Jun 19, 2013 at 2:33 PM, Sharp s.sh...@infovations.ch wrote: Hi Guys I am having difficulties running a suggest Search Handler in a solrcloud environment. The configuration was tested on a standalone machine and works fine there. Here is my configuration: *Schema.xml* field name=suggest type=suggest_text indexed=true stored=false multiValued=true / copyField source=field1 dest=suggest / copyField source=field2 dest=suggest / copyField source=field3 dest=suggest / ... fieldType name=suggest_text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.SynonymFilterFactory synonyms=synonym.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopword.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory / filter class=solr.KeywordMarkerFilterFactory protected=protword.txt / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopword.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory / filter class=solr.KeywordMarkerFilterFactory protected=protword.txt / /analyzer /fieldType *Solrconfig.xml* searchComponent class=solr.SpellCheckComponent name=suggest str name=queryAnalyzerFieldTypesuggest_text/str lst name=spellchecker str name=namesuggest/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str str name=fieldsuggest/str float name=threshold0/float str name=buildOnCommittrue/str /lst lst name=spellchecker str name=namedefault/str str name=fieldsuggest/str str
Re: UnInverted multi-valued field
Take a look at using DocValues for faceted fields. -- Jack Krupansky -Original Message- From: Jochen Lienhard Sent: Wednesday, June 19, 2013 5:30 AM To: solr-user@lucene.apache.org Subject: UnInverted multi-valued field Hi @all. We have the problem that after an update the index takes to much time for 'warm up'. We have some multivalued facet-fields and during the startup solr creates the messages: INFO: UnInverted multi-valued field {field=mt_facet,memSize=18753256,tindexSize=54,time=170,phase1=156,nTerms=17,bigTerms=3,termInstances=903276,uses=0} In the solconfig we use the facet.method 'fc'. We know, that the start-up with the method 'enum' is faster, but then the searches are very slow. How do you handle this problem? Or have you any idea for optimizing the warm up? Or what do you do after an update? Greetings Jochen -- Dr. rer. nat. Jochen Lienhard Dezernat EDV Albert-Ludwigs-Universität Freiburg Universitätsbibliothek Rempartstr. 10-16 | Postfach 1629 79098 Freiburg | 79016 Freiburg Telefon: +49 761 203-3908 E-Mail: lienh...@ub.uni-freiburg.de Internet: www.ub.uni-freiburg.de
Re: Disable Replication for all Cores in a single Command
On 6/19/2013 2:18 AM, Ralf Heyde wrote: Hello Folks, is it possible to disable the replication for ALL cores using one command? We currently use Solr 3.6. Currently we have a curl operation, which fires: http://slave_host:port/solr/core/admin/replication/index.jsp?poll=disable In the documentation there is a URL-Command which seems to be correct, but it says 404. http://slave_host:port/solr/replication?command=disablepoll I don't think there is a way to do this, because each Solr core is self-contained and its configuration is independent of the others. The URL that you have shown that doesn't include the core name will only work in a multicore environment if the defaultCoreName attribute is found in solr.xml, and will only access the specific core that is named there. I know this attribute works in Solr 4.x, but I don't know if it worked in 3.x. I have never used it. It might actually make sense to add one or more actions to the CoreAdmin for this, but I'm fairly sure that the feature doesn't currently exist. Thanks, Shawn
Re: Solr Cloud Hangs consistently .
Update!! Got SOLR cloud working, was able to do 90k document inserts with replicationFactor=2, with my jmeter script, previously was getting stuck with 3k inserts or less. After some investigation, figured out that ulimits for my process were not being set properly, OS defaults were kicking in, which is very small for a server app. One of our install script had changed. I had to up the ulimits - -n,-u,-v and for now no other issues seen. -Original Message- From: Rishi Easwaran rishi.easwa...@aol.com To: solr-user solr-user@lucene.apache.org Sent: Tue, Jun 18, 2013 10:40 am Subject: Re: Solr Cloud Hangs consistently . Mark, All I am doing are inserts, afaik search side deadlocks should not be an issue. I am using Jmeter, standard test driver we use for most of our benchmarks and stats collection. My jmeter.jmx file- http://apaste.info/79IS , maybe i overlooked something Is there a benchmark script that solr community uses (preferably with jmeter), we are write heavy so at the moment focusing on inserts only. Thanks, Rishi. -Original Message- From: Yago Riveiro yago.rive...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Mon, Jun 17, 2013 6:19 pm Subject: Re: Solr Cloud Hangs consistently . I do all the indexing through a HTTP POST, with replicationFactor=1 no problem, if is higher deadlock problems can appear A stack trace like this http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067862 is that I get -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Monday, June 17, 2013 at 11:03 PM, Mark Miller wrote: If it actually happens with replicationFactor=1, it doesn't likely have anything to do with the update handler issue I'm referring to. In some cases like these, people have better luck with Jetty than Tomcat - we test it much more. For instance, it's setup to help avoid search side distributed deadlocks. In any case, there is something special about it - I do and have seen a lot of heavy indexing to SolrCloud by me and others without running into this. Both with replicationFacotor=1 and greater. So there is something specific in how the load is being done or what features/methods are being used that likely causes it or makes it easier to cause. But again, the issue I know about involves threads that are not even created in the replicationFactor = 1 case, so that could be a first report afaik. - Mark On Jun 17, 2013, at 5:52 PM, Rishi Easwaran rishi.easwa...@aol.com (mailto:rishi.easwa...@aol.com) wrote: Update!! This happens with replicationFactor=1 Just for kicks I created a collection with a 24 shards, replicationfactor=1 cluster on my exisiting benchmark env. Same behaviour, SOLR cloud just hangs. Nothing in the logs, top/heap/cpu most metrics looks fine. Only indication seems to be netstat showing incoming request not being read in. Yago, I saw your previous post (http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html#a4067631) Following it, Last week, I upgraded to SOLR 4.3, to see if the issue gets fixed, but no luck. Looks like this is a dominant and easily reproducible issue on SOLR cloud. Thanks, Rishi. -Original Message- From: Yago Riveiro yago.rive...@gmail.com (mailto:yago.rive...@gmail.com) To: solr-user solr-user@lucene.apache.org (mailto:solr-user@lucene.apache.org) Sent: Mon, Jun 17, 2013 5:15 pm Subject: Re: Solr Cloud Hangs consistently . I can confirm that the deadlock happen with only 2 replicas by shard. I need shutdown one node that host a replica of the shard to recover the indexation capability. -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Monday, June 17, 2013 at 6:44 PM, Rishi Easwaran wrote: Hi All, I am trying to benchmark SOLR Cloud and it consistently hangs. Nothing in the logs, no stack trace, no errors, no warnings, just seems stuck. A little bit about my set up. I have 3 benchmark hosts, each with 96GB RAM, 24 CPU's and 1TB SSD. Each host is configured to have 8 SOLR cloud nodes running at 4GB each. JVM configs: http://apaste.info/57Ai My cluster has 12 shards with replication factor 2- http://apaste.info/09sA I originally stated with SOLR 4.2., tomcat 5 and jdk 6, as we are already running this configuration in production in Non-Cloud form. It got stuck repeatedly. I decided to upgrade to the latest and greatest of everything, SOLR 4.3, JDK7 and tomcat7. It still shows same behaviour and hangs through the test. My test schema and config. Schema.xml - http://apaste.info/imah SolrConfig.xml - http://apaste.info/ku4F The test is pretty simple. its a jmeter test with update command via SOAP rpc (round robin
Highlighting using hl.q without a df field
Is it possible to use the hl.q field if you’re using the extended dismax query parser and have defined the “qf” field, but not a “df” field? Here’s a sample query: q=drivefq=cat:electronicshl=truehl.fl=cat,namehl.q=drive cat:electronics. In this case I want to highlight the facet “electronics” and the word “drive” within the cat and name fields. Assuming I’m understanding the wiki correctly, snippets should be generated for the hl.fl fields. What I’m getting is an error message saying “no field name specified in query and no default specified via 'df' param. If I remove the word “drive” from the hl.q field, it works correctly, which makes sense since given the error. I just don’t understand why it’s not using the “qf” or “fl.hl” fields to query against. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Highlighting-using-hl-q-without-a-df-field-tp4071648.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: UnInverted multi-valued field
On Wed, 2013-06-19 at 11:30 +0200, Jochen Lienhard wrote: INFO: UnInverted multi-valued field {field=mt_facet,memSize=18753256,tindexSize=54,time=170,phase1=156,nTerms=17,bigTerms=3,termInstances=903276,uses=0} 170ms does not sound like much to me. What are you hoping for? We know, that the start-up with the method 'enum' is faster, but then the searches are very slow. That is a bit strange. With only 17 terms, enum should be quite fast. How much do the two methods differ in speed? - Toke Eskildsen
Re: yet another optimize question
indeed the actual syntax for per field facet is : f.mysparefieldname.facet.method=enum André On 06/18/2013 09:00 PM, Petersen, Robert wrote: Hi Andre, Wow that is astonishing! I will definitely also try that out! Just set the facet method on a per field basis for the less used sparse facet fields eh? Thanks for the tip. Thanks Robi -Original Message- From: Andre Bois-Crettez [mailto:andre.b...@kelkoo.com] Sent: Tuesday, June 18, 2013 3:03 AM To: solr-user@lucene.apache.org Subject: Re: yet another optimize question Recently we had steadily increasing memory usage and OOM due to facets on dynamic fields. The default facet.method=fc need to build a large array of maxdocs ints for each field (a fieldCache or fieldValueCahe entry), whether it is sparsely populated or not. Once you have reduced your number of maxDocs with the merge policy, it can be interesting to try facet.method=enum for all the sparsely populated dynamic fields. Despite what is said in the wiki, in our case the performance was similar to facet.method=fc, however the JVM heap usage went down from about 20GB to 4GB. André On 06/17/2013 08:21 PM, Petersen, Robert wrote: Also some time ago I made all our caches small enough to keep us from getting OOMs while still having a good hit rate.Our index has about 50 fields which are mostly int IDs and there are some dynamic fields also. These dynamic fields can be used for custom faceting. We have some standard facets we always facet on and other dynamic facets which are only used if the query is filtering on a particular category. There are hundreds of these fields but since they are only for a small subset of the overall index they are very sparsely populated with regard to the overall index. -- André Bois-Crettez Search technology, Kelkoo http://www.kelkoo.com/ Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur. -- André Bois-Crettez Search technology, Kelkoo http://www.kelkoo.com/ Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: Solr Suggest does not work in solrcloud environment
Hi Aloke Thanks for your reply. It works with the http://url.com:8983/solr/mycore/suggest?q=barwt=jsondistrib=true parameter or when inserted into the defaults requestHandler class=org.apache.solr.handler.component.SearchHandler name=/suggest lst name=defaults str name=spellchecktrue/str str name=spellcheck.dictionarydefault/str str name=spellcheck.dictionarysuggest/str str name=spellcheck.onlyMorePopularfalse/str str name=spellcheck.count10/str str name=spellcheck.collatetrue/str bool name=distribfalse/bool /lst arr name=components strsuggest/str /arr /requestHandler I use the bootstrap parameter at startup. So configuration is deployed to all other servers. The query component just creates additional output but nothing usefull. arr name=components strsuggest/str strquery/str /arr So why is the additional parameter necessary? I would assume that solr takes care of it internaly. I have only conifugred one shard. But thanks anyway. It works as a workaround so far. Simon -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Suggest-does-not-work-in-solrcloud-environment-tp4071587p4071660.html Sent from the Solr - User mailing list archive at Nabble.com.
How to dynamically add geo fields to a query using a request handler
Hi We have a request handler defined in solrconfig.xml that specifies a list of fields to return for the request using the fl name. E.g. str name=flcreatedDate/str When constructing a query using solrj that uses this request handler, we want to conditionally add the geo spatial fields that will tell us the distance of a record in the solr index from a given location. Currently we add this to the query by specifying solrQuery.set(fl, *,distance:geodist()); This has the effect of returning all fields for the record - not those specified in the request handler. I'm assuming this is because of the * in the solrQuery.set method is overriding those statically defined in the request handler. I have tried to add the geodist property via the solrQuery.addField() method, but that complains saying it is not a valid field - maybe I used it incorrectly? Has anybody any ideas how to achieve this? Thanks Ade -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-dynamically-add-geo-fields-to-a-query-using-a-request-handler-tp4071655.html Sent from the Solr - User mailing list archive at Nabble.com.
another transaction log + commit question
Hi, We hard committed (/update/csv?commit=true) about 20,000 documents to SolrCloud (5 shards, 1 replicas = 10 jvm instances). We have commented out both autoCommit and autoSoftCommit settings from solrconfig.xml. What we noticed that the transaction log size never goes down to 0. We thought once fsync to all replicas etc. finishes the trans log should get deleted since everything is persisted. We restarted cloud couple times but trans log was always bigger than size of the index for that shard. Why is that? 1.9M$HOME/solr_data/solr1 3.0M$HOME/solr_data/solr1_tranlog 2.2M$HOME/solr_data/solr2 3.0M$HOME/solr_data/solr2_tranlog If we have commented out autoCommit setting from solrconfig.xml and we hard commit say 20K documents every 10 minutes, when will a new searcher get created? Without autoCommit setting, what is the default behavior of new searcher? One last question, does a new searcher get created and all caches gets refreshed for every soft commit? OR Solr updates the existing searcher with what got changed during last soft commit? Many Thanks!
Re: UnInverted multi-valued field
On Wed, Jun 19, 2013 at 5:30 AM, Jochen Lienhard lienh...@ub.uni-freiburg.de wrote: Hi @all. We have the problem that after an update the index takes to much time for 'warm up'. We have some multivalued facet-fields and during the startup solr creates the messages: INFO: UnInverted multi-valued field {field=mt_facet,memSize=** 18753256,tindexSize=54,time=**170,phase1=156,nTerms=17,** bigTerms=3,termInstances=**903276,uses=0} In the solconfig we use the facet.method 'fc'. We know, that the start-up with the method 'enum' is faster, but then the searches are very slow. How do you handle this problem? Or have you any idea for optimizing the warm up? Or what do you do after an update? You probably know, but just in case... you may use autowarming; the searcher will populate the cache and only after the warmup queries finished, will it be exposed to the world. The old searcher continues to handle requests in the meantime. roman Greetings Jochen -- Dr. rer. nat. Jochen Lienhard Dezernat EDV Albert-Ludwigs-Universität Freiburg Universitätsbibliothek Rempartstr. 10-16 | Postfach 1629 79098 Freiburg | 79016 Freiburg Telefon: +49 761 203-3908 E-Mail: lienh...@ub.uni-freiburg.de Internet: www.ub.uni-freiburg.de
Re: TieredMergePolicy reclaimDeletesWeight
The default is 2.0, and higher values will more strongly favor merging segments with deletes. I think 20.0 is likely way too high ... maybe try 3-5? Mike McCandless http://blog.mikemccandless.com On Tue, Jun 18, 2013 at 6:46 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: Hi In continuing a previous conversation, I am attempting to not have to do optimizes on our continuously updated index in solr3.6.1 and I came across the mention of the reclaimDeletesWeight setting in this blog: http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html We do a *lot* of deletes in our index so I want to make the merges be more aggressive on reclaiming deletes, but I am having trouble finding much out about this setting. Does anyone have experience with this setting? Would the below accomplish what I want ie for it to go after deletes more aggressively than normal? I got the impression 10.0 was the default from looking at this code but I could be wrong: https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessfulBuild/clover-report/org/apache/lucene/index/TieredMergePolicy.html?id=3085 mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce20/int int name=segmentsPerTier8/int double name=reclaimDeletesWeight20.0/double /mergePolicy Thanks Robert (Robi) Petersen Senior Software Engineer Search Department
Re: Question about SOLR search relevance score
On 19 June 2013 21:15, sérgio Alves sd_t_al...@hotmail.com wrote: [...] Right now we're having problems with some common search terms. They return varied results on the search results, and the products which should appear first in the results, are scored lower than other, seemingly unrelated, products. [...] I wanted to know if there is a parameter or any possible way for me to know the way that solr calculates the scores it returns. For example, if we had a search relevancy formula like QF=attributes_name^15+attributes_brand^10+attributes_category^8, how can I know that brand scored 'x', for name 'y' and category 'z'. Is that possible? How can I do that? [...] To get an explanation of the scoring, add debugQuery=on as a parameter to your Solr search URL. Please see http://wiki.apache.org/solr/CommonQueryParameters#debugQuery There are also various 'explain' parameters that might be useful. I take it that you have already seen http://wiki.apache.org/solr/SolrRelevancyFAQ Regards, Gora
Question about SOLR search relevance score
Hi. My name is Sérgio Alves and I'm a developer in a project that uses solr as its search engine. Right now we're having problems with some common search terms. They return varied results on the search results, and the products which should appear first in the results, are scored lower than other, seemingly unrelated, products. I wanted to know if there is a parameter or any possible way for me to know the way that solr calculates the scores it returns. For example, if we had a search relevancy formula like QF=attributes_name^15+attributes_brand^10+attributes_category^8, how can I know that brand scored 'x', for name 'y' and category 'z'. Is that possible? How can I do that? This is urgent, if someone could take the time and answer this topic to me in a quick manner, I would really appreciate it. Thank you very much for the attention, best regards, Sérgio Alves
RE: Question about SOLR search relevance score
Hi Sergio, Append 'debugQuery=on' to your queries to learn more about how your queries are being evaluated/ranked. i.e. qf=attributes_name^15+attributes_brand^10+attributes_category^8debugQuery=on You'll get an XML section that is dedicated to debug information. I've found http://explain.solr.pl/ useful in understanding and visualizing the debug output. Swati -Original Message- From: sérgio Alves [mailto:sd_t_al...@hotmail.com] Sent: Wednesday, June 19, 2013 11:45 AM To: solr-user@lucene.apache.org Subject: Question about SOLR search relevance score Hi. My name is Sérgio Alves and I'm a developer in a project that uses solr as its search engine. Right now we're having problems with some common search terms. They return varied results on the search results, and the products which should appear first in the results, are scored lower than other, seemingly unrelated, products. I wanted to know if there is a parameter or any possible way for me to know the way that solr calculates the scores it returns. For example, if we had a search relevancy formula like QF=attributes_name^15+attributes_brand^10+attributes_category^8, how can I know that brand scored 'x', for name 'y' and category 'z'. Is that possible? How can I do that? This is urgent, if someone could take the time and answer this topic to me in a quick manner, I would really appreciate it. Thank you very much for the attention, best regards, Sérgio Alves
Apparent odd interaction between autoCommit values and indexing ram buffer
I've run into something a little odd that's been happening for a while. The apparent symptoms: Two index segments are created every time an autoCommit (hard, not soft) happens during a DIH full-import. Here's the directory listing from the first few minutes of importing, and a related INFOSTREAM: http://apaste.info/22ue https://dl.dropboxusercontent.com/u/97770508/INFOSTREAM-s1build.txt The INFOSTREAM file has cruft from before, so if you search for 3g8 in the file, you'll be at the beginning of the relevant section. I brought this up without resolution on the dev list last December. After some discussion in #solr-dev yesterday and some poking around with branch_4x, I think I might have figured out (at a high level) what's going on. My 'ramBufferSizeMB' value is 48, and my autoCommit maxDocs is 25000. My documents probably tend to be 1-2kb, with some increasing a little beyond that. Looking at the numDocs for each segment, here's what I think is happening: The autoCommit kicks in after the first 25000 docs (25002 to be precise), but the ram buffer isn't emptied. The next 3339 documents get indexed, at which point the ram buffer fills up, so it flushes another segment. Then it does another 21674 docs to approximately reach 25000 for autoCommit, which forces another segment flush, but without emptying the buffer. lather, rinse, repeat. Each pair of numDocs values after the initial 25002 does add up to approximately 25000. If I'm right about what's happening here, then here's the big question: Should the ram buffer be emptied when autoCommit triggers? I think that it should, but can it be done without drastically affecting performance? I haven't looked at the code ... I expect that it'll take me forever to understand it well enough to figure out if I'm right or wrong.
Update by query?
Quick check to see if Solr supports an update-by-query feature or if anyone has thought about something like this ... similar to delete-by-query My specific use case is a metadata field needs to be updated for N docs where N 1 and the set can easily be identified by a query. Currently, I have to pull them all back and update, which works but is concerning when N is very large. I checked JIRA and didn't see mention of this but might have missed it. Cheers, Tim
SOLR : ArrayIndexOutOfBoundsException from SolrDispatchFilter
Need help to figure out the error below. *Code Snippet*: public class ConnectionComponent extends SearchComponent { @Override public void process(ResponseBuilder rb) throws IOException { NamedList nList = new SimpleOrderedMap(); NamedList nl= new SimpleOrderedMap(); ListDocument ld = new ArrayListDocument(); Document mydoc = new Document(); mydoc.add(f); //IndexableField f not null ld.add(mydoc); nl.add(someKey, ld); nList.add(otherKey, nl); // rb instance of ResponseBuilder rb.rsp.add(returnKey, nList); } } RROR org.apache.solr.servlet.SolrDispatchFilter ? null:java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.get(ArrayList.java:324) at java.util.Collections$UnmodifiableList.get(Collections.java:1152) at org.apache.solr.response.transform.ValueSourceAugmenter.transform(ValueSourceAugmenter.java:92) at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:165) at org.apache.solr.response.JSONWriter.writeArray(JSONResponseWriter.java:526) at org.apache.solr.response.TextResponseWriter.writeArray(TextResponseWriter.java:289) at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:192) at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183) at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299) at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)
Re: Apparent odd interaction between autoCommit values and indexing ram buffer
On 6/19/2013 10:38 AM, Shawn Heisey wrote: Looking at the numDocs for each segment, here's what I think is happening: The autoCommit kicks in after the first 25000 docs (25002 to be precise), but the ram buffer isn't emptied. The next 3339 documents get indexed, at which point the ram buffer fills up, so it flushes another segment. Then it does another 21674 docs to approximately reach 25000 for autoCommit, which forces another segment flush, but without emptying the buffer. lather, rinse, repeat. I seem to be wrong about it being strictly related to ramBufferSizeMB. Today I bumped the buffer up to 256MB, restarted Solr, and started another full-import. If I were completely right about the buffer interaction, this should have resulted in a few somewhat equal sized segments being created before creating a small one. It didn't change anything - it's still two segments per autocommit, one of which is around 3000 docs and the other adds to that to make about 25000. There's still something weird going on, but now I know that I don't completely understand it. I hope someone can shed some light. Thanks, Shawn
Re: Update by query?
It has come up before as a nice feature to have, but isn't in Solr right now. I'd say go ahead and file a Jira for a new feature. -- Jack Krupansky -Original Message- From: Timothy Potter Sent: Wednesday, June 19, 2013 12:57 PM To: solr-user@lucene.apache.org Subject: Update by query? Quick check to see if Solr supports an update-by-query feature or if anyone has thought about something like this ... similar to delete-by-query My specific use case is a metadata field needs to be updated for N docs where N 1 and the set can easily be identified by a query. Currently, I have to pull them all back and update, which works but is concerning when N is very large. I checked JIRA and didn't see mention of this but might have missed it. Cheers, Tim
Wildcards and Phrase queries
Hi, I'm trying to understand what is the status of enabling wildcards on phrase queries? Lucene JIRA issue: https://issues.apache.org/jira/browse/LUCENE-1486 Solr JIRA issue: https://issues.apache.org/jira/browse/SOLR-1604 It looks like these issues are not going to be solved in the close future :( Will they? Did they came into a (partially) dead-end, in the current approach. Can I contribute anything to make them fixed into an official version? Does the lastest patches which attached to rthe JIRAs are production ready? [Should this message be sent to java-user list?]
RE: yet another optimize question
Hi Walter, I used to have larger settings on our caches but it seemed like I had to make the caches that small to reduce memory usage to keep from getting the dreaded OOM exceptions. Also our search is behind Akamai with a one hour TTL. Our slave farm has a load balancer in front of twelve slave servers and our index is being updated constantly, pretty much 24/7. So my question would be how do you run with such big caches without going into the OOM zone? Was the Netflix index only updated based upon the release schedules of the studios, like once a week? Our entertainment stores used to be like that before we turned into a marketplace based e-tailer, but now we get new listings from merchants all the time and so have a constant churn of additions and deletions in our index. I feel like at 32GB our heap is really huge, but we seem to use almost all of it with these settings. I am trying out the G1GC on one slave to see if that gets memory usage lower but while it has a different collection pattern in the various spaces it seems like the total memory usage peaks out at about the same level. Thanks Robi -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Tuesday, June 18, 2013 6:57 PM To: solr-user@lucene.apache.org Subject: Re: yet another optimize question Your query cache is far too small. Most of the default caches are too small. We run with 10K entries and get a hit rate around 0.30 across four servers. This rate goes up with more queries, down with less, but try a bigger cache, especially if you are updating the index infrequently, like once per day. At Netflix, we had a 0.12 hit rate on the query cache, even with an HTTP cache in front of it. The HTTP cache had an 80% hit rate. I'd increase your document cache, too. I usually see about 0.75 or better on that. wunder On Jun 18, 2013, at 10:22 AM, Petersen, Robert wrote: Hi Otis, Yes the query results cache is just about worthless. I guess we have too diverse of a set of user queries. The business unit has decided to let bots crawl our search pages too so that doesn't help either. I turned it way down but decided to keep it because my understanding was that it would still help for users going from page 1 to page 2 in a search. Is that true? Thanks Robi -Original Message- From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] Sent: Monday, June 17, 2013 6:39 PM To: solr-user@lucene.apache.org Subject: Re: yet another optimize question Hi Robi, This goes against the original problem of getting OOMEs, but it looks like each of your Solr caches could be a little bigger if you want to eliminate evictions, with the query results one possibly not being worth keeping if you can't get the hit % up enough. Otis -- Solr ElasticSearch Support -- http://sematext.com/ On Mon, Jun 17, 2013 at 2:21 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: Hi Otis, Right I didn't restart the JVMs except on the one slave where I was experimenting with using G1GC on the 1.7.0_21 JRE. Also some time ago I made all our caches small enough to keep us from getting OOMs while still having a good hit rate.Our index has about 50 fields which are mostly int IDs and there are some dynamic fields also. These dynamic fields can be used for custom faceting. We have some standard facets we always facet on and other dynamic facets which are only used if the query is filtering on a particular category. There are hundreds of these fields but since they are only for a small subset of the overall index they are very sparsely populated with regard to the overall index. With CMS GC we get a sawtooth on the old generation (I guess every replication and commit causes it's usage to drop down to 10GB or so) and it seems to be the old generation which is the main space consumer. With the G1GC, the memory map looked totally different! I was a little lost looking at memory consumption with that GC. Maybe I'll try it again now that the index is a bit smaller than it was last time I tried it. After four days without running an optimize now it is 21GB. BTW our indexing speed is mostly bound by the DB so reducing the segments might be ok... Here is a quick snapshot of one slaves memory map as reported by PSI-Probe, but unfortunately I guess I can't send the history graphics to the solr-user list to show their changes over time: NameUsedCommitted Max Initial Group Par Survivor Space 20.02 MB108.13 MB 108.13 MB 108.13 MB HEAP CMS Perm Gen 42.29 MB70.66 MB82.00 MB20.75 MBNON_HEAP Code Cache 9.73 MB 9.88 MB 48.00 MB2.44 MB NON_HEAP CMS Old Gen20.22 GB30.94 GB30.94 GB 30.94 GB
RE: TieredMergePolicy reclaimDeletesWeight
OK thanks, will do. Just out of curiosity, what would having that set way too high do? Would the index become fragmented or what? -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, June 19, 2013 9:33 AM To: solr-user@lucene.apache.org Subject: Re: TieredMergePolicy reclaimDeletesWeight The default is 2.0, and higher values will more strongly favor merging segments with deletes. I think 20.0 is likely way too high ... maybe try 3-5? Mike McCandless http://blog.mikemccandless.com On Tue, Jun 18, 2013 at 6:46 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: Hi In continuing a previous conversation, I am attempting to not have to do optimizes on our continuously updated index in solr3.6.1 and I came across the mention of the reclaimDeletesWeight setting in this blog: http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-mer ges.html We do a *lot* of deletes in our index so I want to make the merges be more aggressive on reclaiming deletes, but I am having trouble finding much out about this setting. Does anyone have experience with this setting? Would the below accomplish what I want ie for it to go after deletes more aggressively than normal? I got the impression 10.0 was the default from looking at this code but I could be wrong: https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessfulB uild/clover-report/org/apache/lucene/index/TieredMergePolicy.html?id=3 085 mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce20/int int name=segmentsPerTier8/int double name=reclaimDeletesWeight20.0/double /mergePolicy Thanks Robert (Robi) Petersen Senior Software Engineer Search Department
Sharding and Replication clarification
Hi, I had questions on implementation of Sharding and Replication features of Solr/Cloud. 1. I noticed that when sharding is enabled for a collection - individual requests are sent to each node serving as a shard. 2. Replication too follows above strategy of sending individual documents to the nodes serving as a replica. I am working with a system that requires massive number of writes - I have noticed that due to above reason - the cloud eventually starts to fail (Even though I am using a ensemble). I do understand the reason behind individual updates - but why not batch them up or give a option to batch N updates in either of the above case - I did come across a presentation that talked about batching 10 updates for replication at least, but I do not think this is the case. - Asif
Re: TieredMergePolicy reclaimDeletesWeight
Way too high would cause it to pick highly lopsided merges just because a few deletes were removed. Highly lopsided merges (e.g. one big segment and N tiny segments) can be horrible because it can lead to O(N^2) merge cost over time. Mike McCandless http://blog.mikemccandless.com On Wed, Jun 19, 2013 at 1:36 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: OK thanks, will do. Just out of curiosity, what would having that set way too high do? Would the index become fragmented or what? -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, June 19, 2013 9:33 AM To: solr-user@lucene.apache.org Subject: Re: TieredMergePolicy reclaimDeletesWeight The default is 2.0, and higher values will more strongly favor merging segments with deletes. I think 20.0 is likely way too high ... maybe try 3-5? Mike McCandless http://blog.mikemccandless.com On Tue, Jun 18, 2013 at 6:46 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: Hi In continuing a previous conversation, I am attempting to not have to do optimizes on our continuously updated index in solr3.6.1 and I came across the mention of the reclaimDeletesWeight setting in this blog: http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-mer ges.html We do a *lot* of deletes in our index so I want to make the merges be more aggressive on reclaiming deletes, but I am having trouble finding much out about this setting. Does anyone have experience with this setting? Would the below accomplish what I want ie for it to go after deletes more aggressively than normal? I got the impression 10.0 was the default from looking at this code but I could be wrong: https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessfulB uild/clover-report/org/apache/lucene/index/TieredMergePolicy.html?id=3 085 mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce20/int int name=segmentsPerTier8/int double name=reclaimDeletesWeight20.0/double /mergePolicy Thanks Robert (Robi) Petersen Senior Software Engineer Search Department
RE: TieredMergePolicy reclaimDeletesWeight
Oh! Thanks for the info. I'll change that right away. -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, June 19, 2013 10:42 AM To: solr-user@lucene.apache.org Subject: Re: TieredMergePolicy reclaimDeletesWeight Way too high would cause it to pick highly lopsided merges just because a few deletes were removed. Highly lopsided merges (e.g. one big segment and N tiny segments) can be horrible because it can lead to O(N^2) merge cost over time. Mike McCandless http://blog.mikemccandless.com On Wed, Jun 19, 2013 at 1:36 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: OK thanks, will do. Just out of curiosity, what would having that set way too high do? Would the index become fragmented or what? -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, June 19, 2013 9:33 AM To: solr-user@lucene.apache.org Subject: Re: TieredMergePolicy reclaimDeletesWeight The default is 2.0, and higher values will more strongly favor merging segments with deletes. I think 20.0 is likely way too high ... maybe try 3-5? Mike McCandless http://blog.mikemccandless.com On Tue, Jun 18, 2013 at 6:46 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: Hi In continuing a previous conversation, I am attempting to not have to do optimizes on our continuously updated index in solr3.6.1 and I came across the mention of the reclaimDeletesWeight setting in this blog: http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-me r ges.html We do a *lot* of deletes in our index so I want to make the merges be more aggressive on reclaiming deletes, but I am having trouble finding much out about this setting. Does anyone have experience with this setting? Would the below accomplish what I want ie for it to go after deletes more aggressively than normal? I got the impression 10.0 was the default from looking at this code but I could be wrong: https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessful B uild/clover-report/org/apache/lucene/index/TieredMergePolicy.html?id= 3 085 mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce20/int int name=segmentsPerTier8/int double name=reclaimDeletesWeight20.0/double /mergePolicy Thanks Robert (Robi) Petersen Senior Software Engineer Search Department
Re: yet another optimize question
I generally run with an 8GB heap for a system that does no faceting. 32GB does seem rather large, but you really should have room for bigger caches. The Akamai cache will reduce your hit rate a lot. That is OK, because users are getting faster responses than they would from Solr. A 5% hit rate may be OK since you have that front end HTTP cache. The Netflix index was updated daily. wunder On Jun 19, 2013, at 10:36 AM, Petersen, Robert wrote: Hi Walter, I used to have larger settings on our caches but it seemed like I had to make the caches that small to reduce memory usage to keep from getting the dreaded OOM exceptions. Also our search is behind Akamai with a one hour TTL. Our slave farm has a load balancer in front of twelve slave servers and our index is being updated constantly, pretty much 24/7. So my question would be how do you run with such big caches without going into the OOM zone? Was the Netflix index only updated based upon the release schedules of the studios, like once a week? Our entertainment stores used to be like that before we turned into a marketplace based e-tailer, but now we get new listings from merchants all the time and so have a constant churn of additions and deletions in our index. I feel like at 32GB our heap is really huge, but we seem to use almost all of it with these settings. I am trying out the G1GC on one slave to see if that gets memory usage lower but while it has a different collection pattern in the various spaces it seems like the total memory usage peaks out at about the same level. Thanks Robi -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Tuesday, June 18, 2013 6:57 PM To: solr-user@lucene.apache.org Subject: Re: yet another optimize question Your query cache is far too small. Most of the default caches are too small. We run with 10K entries and get a hit rate around 0.30 across four servers. This rate goes up with more queries, down with less, but try a bigger cache, especially if you are updating the index infrequently, like once per day. At Netflix, we had a 0.12 hit rate on the query cache, even with an HTTP cache in front of it. The HTTP cache had an 80% hit rate. I'd increase your document cache, too. I usually see about 0.75 or better on that. wunder On Jun 18, 2013, at 10:22 AM, Petersen, Robert wrote: Hi Otis, Yes the query results cache is just about worthless. I guess we have too diverse of a set of user queries. The business unit has decided to let bots crawl our search pages too so that doesn't help either. I turned it way down but decided to keep it because my understanding was that it would still help for users going from page 1 to page 2 in a search. Is that true? Thanks Robi -Original Message- From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] Sent: Monday, June 17, 2013 6:39 PM To: solr-user@lucene.apache.org Subject: Re: yet another optimize question Hi Robi, This goes against the original problem of getting OOMEs, but it looks like each of your Solr caches could be a little bigger if you want to eliminate evictions, with the query results one possibly not being worth keeping if you can't get the hit % up enough. Otis -- Solr ElasticSearch Support -- http://sematext.com/ On Mon, Jun 17, 2013 at 2:21 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: Hi Otis, Right I didn't restart the JVMs except on the one slave where I was experimenting with using G1GC on the 1.7.0_21 JRE. Also some time ago I made all our caches small enough to keep us from getting OOMs while still having a good hit rate.Our index has about 50 fields which are mostly int IDs and there are some dynamic fields also. These dynamic fields can be used for custom faceting. We have some standard facets we always facet on and other dynamic facets which are only used if the query is filtering on a particular category. There are hundreds of these fields but since they are only for a small subset of the overall index they are very sparsely populated with regard to the overall index. With CMS GC we get a sawtooth on the old generation (I guess every replication and commit causes it's usage to drop down to 10GB or so) and it seems to be the old generation which is the main space consumer. With the G1GC, the memory map looked totally different! I was a little lost looking at memory consumption with that GC. Maybe I'll try it again now that the index is a bit smaller than it was last time I tried it. After four days without running an optimize now it is 21GB. BTW our indexing speed is mostly bound by the DB so reducing the segments might be ok... Here is a quick snapshot of one slaves memory map as reported by PSI-Probe, but unfortunately I guess I can't send the history graphics to the solr-user list to show their
RE: yet another optimize question
We actually have hundreds of facet-able fields, but most are specialized and are only faceted upon if the user has drilled into the particular category to which they are applicable and so they are only indexed for products in those categories. I guess it is the facets that eat up so much of our memory. It was suggested that if I use facet method = enum for those particular specialized facets then my memory usage would go down. I'm going to try that out and see how much it helps. Thanks Robi -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Wednesday, June 19, 2013 10:50 AM To: solr-user@lucene.apache.org Subject: Re: yet another optimize question I generally run with an 8GB heap for a system that does no faceting. 32GB does seem rather large, but you really should have room for bigger caches. The Akamai cache will reduce your hit rate a lot. That is OK, because users are getting faster responses than they would from Solr. A 5% hit rate may be OK since you have that front end HTTP cache. The Netflix index was updated daily. wunder On Jun 19, 2013, at 10:36 AM, Petersen, Robert wrote: Hi Walter, I used to have larger settings on our caches but it seemed like I had to make the caches that small to reduce memory usage to keep from getting the dreaded OOM exceptions. Also our search is behind Akamai with a one hour TTL. Our slave farm has a load balancer in front of twelve slave servers and our index is being updated constantly, pretty much 24/7. So my question would be how do you run with such big caches without going into the OOM zone? Was the Netflix index only updated based upon the release schedules of the studios, like once a week? Our entertainment stores used to be like that before we turned into a marketplace based e-tailer, but now we get new listings from merchants all the time and so have a constant churn of additions and deletions in our index. I feel like at 32GB our heap is really huge, but we seem to use almost all of it with these settings. I am trying out the G1GC on one slave to see if that gets memory usage lower but while it has a different collection pattern in the various spaces it seems like the total memory usage peaks out at about the same level. Thanks Robi -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Tuesday, June 18, 2013 6:57 PM To: solr-user@lucene.apache.org Subject: Re: yet another optimize question Your query cache is far too small. Most of the default caches are too small. We run with 10K entries and get a hit rate around 0.30 across four servers. This rate goes up with more queries, down with less, but try a bigger cache, especially if you are updating the index infrequently, like once per day. At Netflix, we had a 0.12 hit rate on the query cache, even with an HTTP cache in front of it. The HTTP cache had an 80% hit rate. I'd increase your document cache, too. I usually see about 0.75 or better on that. wunder On Jun 18, 2013, at 10:22 AM, Petersen, Robert wrote: Hi Otis, Yes the query results cache is just about worthless. I guess we have too diverse of a set of user queries. The business unit has decided to let bots crawl our search pages too so that doesn't help either. I turned it way down but decided to keep it because my understanding was that it would still help for users going from page 1 to page 2 in a search. Is that true? Thanks Robi -Original Message- From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] Sent: Monday, June 17, 2013 6:39 PM To: solr-user@lucene.apache.org Subject: Re: yet another optimize question Hi Robi, This goes against the original problem of getting OOMEs, but it looks like each of your Solr caches could be a little bigger if you want to eliminate evictions, with the query results one possibly not being worth keeping if you can't get the hit % up enough. Otis -- Solr ElasticSearch Support -- http://sematext.com/ On Mon, Jun 17, 2013 at 2:21 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: Hi Otis, Right I didn't restart the JVMs except on the one slave where I was experimenting with using G1GC on the 1.7.0_21 JRE. Also some time ago I made all our caches small enough to keep us from getting OOMs while still having a good hit rate.Our index has about 50 fields which are mostly int IDs and there are some dynamic fields also. These dynamic fields can be used for custom faceting. We have some standard facets we always facet on and other dynamic facets which are only used if the query is filtering on a particular category. There are hundreds of these fields but since they are only for a small subset of the overall index they are very sparsely populated with regard to the overall index. With CMS GC we get a sawtooth on the old generation (I guess every
solr spatial search with distance to search results
I was reading this: http://wiki.apache.org/solr/SpatialSearch I have this Solr query: http://localhost:8983/solr/tt/select/?indent=onfacet=truefq={!geofilt}pt=51.4416420,5.4697225sfield=geolocationd=20sort=geodist()%20ascq=*:*start=0rows=10fl=_dist_:geodist(),id,title,lat,lng,locationfacet.mincount=1 And this in my schema.xml fieldType name=location class=solr.LatLonType subFieldSuffix=_coordinate/ field name=geolocation type=location indexed=true stored=true/ dynamicField name=*_coordinate type=tdouble indexed=true stored=false/ However, with my current query string, I don't see a distance field in the document, and also no location field. What am I missing? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-spatial-search-with-distance-to-search-results-tp4071745.html Sent from the Solr - User mailing list archive at Nabble.com.
fq vs q parameter
Hi, I am currently using the below configuration in one of my handler and I was thinking of removing the values from q parameter and including as a part of fq parameter. Can someone let me know if there is any performance improvement when using fq parameter compared to q? str name=q ( _query_:{!dismax qf=person_name_lname_i v=$fps_lname}^8.3 OR ) /str /lst lst name=appends str name=fq{!switch case='*:*' default=$fq_bbox v=$fps_latlong}/str /lst lst name=invariants str name=fq_bbox_query_:{!bbox pt=$fps_latlong sfield=geo d=$fps_dist}^0.2/str /lst -- View this message in context: http://lucene.472066.n3.nabble.com/fq-vs-q-parameter-tp4071748.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: fq vs q parameter
Yes, definitely, fq parameters don't affect scoring and can be cached. Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Wed, Jun 19, 2013 at 4:27 PM, Learner bbar...@gmail.com wrote: Hi, I am currently using the below configuration in one of my handler and I was thinking of removing the values from q parameter and including as a part of fq parameter. Can someone let me know if there is any performance improvement when using fq parameter compared to q? str name=q ( _query_:{!dismax qf=person_name_lname_i v=$fps_lname}^8.3 OR ) /str /lst lst name=appends str name=fq{!switch case='*:*' default=$fq_bbox v=$fps_latlong}/str /lst lst name=invariants str name=fq_bbox_query_:{!bbox pt=$fps_latlong sfield=geo d=$fps_dist}^0.2/str /lst -- View this message in context: http://lucene.472066.n3.nabble.com/fq-vs-q-parameter-tp4071748.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: fq vs q parameter
I see that your query has boost value so this mean you need Solr to Score on each match document. One of the key difference between q and fq is thats fq will not have any impact on score. where as having it in q will score each document based on the Similarity Score. -- View this message in context: http://lucene.472066.n3.nabble.com/fq-vs-q-parameter-tp4071748p4071758.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: fq vs q parameter
+1 q and fq both can be cached. -- View this message in context: http://lucene.472066.n3.nabble.com/fq-vs-q-parameter-tp4071748p4071759.html Sent from the Solr - User mailing list archive at Nabble.com.
Informal poll on running Solr 4 on Java 7 with G1GC
I'm sure there's some site to do this but wanted to get a feel for who's running Solr 4 on Java 7 with G1 gc enabled? Cheers, Tim
Re: Adding documents in Solr plugin
I think this makes sense. Timothy asked about update by query in the last 24 hours and this sounds like the same thing. Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Jun 19, 2013 at 3:52 AM, Avner Levy av...@checkpoint.com wrote: I have a core with millions of records. I want to add a custom handler which scan the existing documents and update one of the field (delete and add document) based on a condition (age12 for example). All fields are stored so there is no problem to recreate the document from the search result. I prefer doing it on the Solr server side for avoiding sending millions of documents to the client and back. I'm thinking of writing a solr plugin which will receive a query and update some fields on the query documents (like the delete by query handler). Are existing solutions or better alternatives? I couldn't find any examples of Solr plugins which update / add / delete documents (I don't need to extend the update handler). If someone has an example it will be great help. Thanks in advance
Re: Merge tool based on mergefactor
Hi, On Wed, Jun 19, 2013 at 3:52 AM, Cosimo Streppone cos...@streppone.it wrote: On 06/19/2013 03:21 AM, Otis Gospodnetic wrote: You could call the optimize command directly on slaves, but specify the target number of segments, e.g. /solr/update?optimize=truemaxSegments=10 Not sure I recommend doing this on slaves, but you could - maybe you have spare capacity. You may also want to consider not doing it on all your slaves at the same time... IIUC this assumes your slaves do not replicate too often, otherwise replication would reset the index to whatever number of segments the master has. You could still perform an optimize with maxSegments after every replication, if it's acceptable in the situation you are in. However, if you need slaves to update every 2-5 minutes, that would be impractical and wasteful. Is this correct? Correct. If so, how to find a fair compromise/balance between master and slave merge factors if you need very frequent indexing of new documents (say continuous) on the master and up-to-date indexes on the slaves (say 2-5' pollInterval)? If you need that you go to SolrCloud and start using softCommits. Otis -- Solr ElasticSearch Support http://sematext.com/
update solr.xml dynamically to add new cores
Hi, Is there a way to edit solr.xml as a part of debian package installation to add new cores. In my use case, there 4 solr indexes and they are managed/configured by different teams. The way I am thinking packages will work is as described below, 1. There will be a solr-base debian package which comes with solr installtion with tomcat setup (I am planning to use solr 4.3) 2. There will be individual index debian packages like, solr-index1, solr-index2 which will be dependent on solr-base. Each package's DEBIAN postinst script will have a logic to edit solr.xml to add new index like index1, index2, etc. Does this sound good? or is there a better/different way to do this? Any pointers will be much appreciated. Thanks, -M -- View this message in context: http://lucene.472066.n3.nabble.com/update-solr-xml-dynamically-to-add-new-cores-tp4071800.html Sent from the Solr - User mailing list archive at Nabble.com.
Partial update using solr 4.3 with csv input
I was going through this link http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/ and one of the comments is about support for csv. Since the comment is almost a year old, just wondering if this is still true that, partial updates are possible only with xml and json input? Thanks, -M -- View this message in context: http://lucene.472066.n3.nabble.com/Partial-update-using-solr-4-3-with-csv-input-tp4071801.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Partial update using solr 4.3 with csv input
Correct, no atomic update for CSV format. There just isn't any place to put the atomic update options in such a simple text format. -- Jack Krupansky -Original Message- From: smanad Sent: Wednesday, June 19, 2013 8:30 PM To: solr-user@lucene.apache.org Subject: Partial update using solr 4.3 with csv input I was going through this link http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/ and one of the comments is about support for csv. Since the comment is almost a year old, just wondering if this is still true that, partial updates are possible only with xml and json input? Thanks, -M -- View this message in context: http://lucene.472066.n3.nabble.com/Partial-update-using-solr-4-3-with-csv-input-tp4071801.html Sent from the Solr - User mailing list archive at Nabble.com.
SolrCloud - Score calculation
Hi, Sorry if its a very basic question but I am pretty new to SolrCloud and I am trying to understand the underlying mechanism for calculating relevancy. Currently we are using SOLR 3.6.X and we use shards to perform distributed searching. Our shards are not of equal size hence sometimes the results are not as we expected. For ex: Shard 1 has 30 million documents, Shard 2 has 30 millon documents and shard 3 has just 3 million documents (push indexing via message queue). When we do a search using shards, documents from shard 1 and shard 2 gets higher priority compared to documents in shard 3 (since its smaller). Currently we add index time boost when adding documents to shard 3 so that the documents from shard 3 also comes up (higher) in search results. Now when using SolrCloud, say for example if one shard has person name repeated 5 times (with different unique id) and we have one more same person name in shard 2 (with diff id), and when we do a search how does SOLR calculate the score? Does it do something like constant scoring across various shards in order to bring up the search results across various shards? How does the score gets calculated.. Does the score of all 6 documents have same value(5 from shard 1 and 1 from shard 2 -if all the fields have same value except for unique id)? Thanks, BB -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Score-calculation-tp4071805.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud - Score calculation
The reason for the issue you are seeing is the IDF component in te score. IDF = inverse document frequency. The document frequency is the number of times a document appears in the index. The higher the document frequency, the mre common the term and thus the less relevant it is. The document frequency is inverted to give a higher number for more relevant terms. Solr does not yet support distributed IDF. Therefore the document frequency is a 3m shard will be higher (as a proportion of your index) compared to your 30m shard, thus it ill score lower. I am not aware of a multiplier you can use to fix this. There is a distributed IDF ticket in JIRA, maybe that is mature enough and might help you. Upayavira On Thu, Jun 20, 2013, at 01:56 AM, Learner wrote: Hi, Sorry if its a very basic question but I am pretty new to SolrCloud and I am trying to understand the underlying mechanism for calculating relevancy. Currently we are using SOLR 3.6.X and we use shards to perform distributed searching. Our shards are not of equal size hence sometimes the results are not as we expected. For ex: Shard 1 has 30 million documents, Shard 2 has 30 millon documents and shard 3 has just 3 million documents (push indexing via message queue). When we do a search using shards, documents from shard 1 and shard 2 gets higher priority compared to documents in shard 3 (since its smaller). Currently we add index time boost when adding documents to shard 3 so that the documents from shard 3 also comes up (higher) in search results. Now when using SolrCloud, say for example if one shard has person name repeated 5 times (with different unique id) and we have one more same person name in shard 2 (with diff id), and when we do a search how does SOLR calculate the score? Does it do something like constant scoring across various shards in order to bring up the search results across various shards? How does the score gets calculated.. Does the score of all 6 documents have same value(5 from shard 1 and 1 from shard 2 -if all the fields have same value except for unique id)? Thanks, BB -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Score-calculation-tp4071805.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Informal poll on running Solr 4 on Java 7 with G1GC
On 6/19/2013 4:18 PM, Timothy Potter wrote: I'm sure there's some site to do this but wanted to get a feel for who's running Solr 4 on Java 7 with G1 gc enabled? I have tried it, but found that G1 didn't give me any better GC pause characteristics than CMS without tuning, and may have actually been worse. Now I use CMS with several tuning options. Thanks, Shawn
Re: PostingsSolrHighlighter not working on Multivalue field
Hi Erick, multivalue is my typo, thanks for your reminding. There is no log show anything wrong or exception occurred. The field definition as following field name=summary type=text indexed=true stored=true omitNorms=false termVectors=true termPositions=true termOffsets=true storeOffsetsWithPositions=true/ dynamicField name=* type=text indexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true omitNorms=false storeOffsetsWithPositions=true/ The PostingSolrHighlighter only do highlight on summary field. When I send a xml file to solr like this ?xml version=1.0 encoding=utf-8? command add doc field name=summaryfacebook yahoo plurk twitter social nextworing/field field name=body_0facebook yahoo plurk twitter social nextworing/field /doc /add /command As you can see the body_0 will be treated using dynamicField definition. Part of the debug response return of Solr like this lst name=highlighting lst name=645 arr name=summary stremFacebook/em... emFacebook/em/str /arr arr name=body_0/ /lst I'm sure hl.fl contains both summary and body_0. This behavior is different between PostingSolrHighlighter and FastVectorhighlighter. Please kindly help on this. Many thanks. Floyd 2013/6/19 Erick Erickson erickerick...@gmail.com Well, _how_ does it fail? unless it's a type it should be multiValued (not capital 'V'). This probably isn't the problem, but just in case. Anything in the logs? What is the field definition? Did you re-index after changing to multiValued? Best Erick On Tue, Jun 18, 2013 at 11:01 PM, Floyd Wu floyd...@gmail.com wrote: In my test case, it seems this new highlighter not working. When field set multivalue=true, the stored text in this field can not be highlighted. Am I miss something? Or this is current limitation? I have no luck to find any documentations mentioned this. Floyd
RE: Solr 4.2 in SolrCloud mode lost response for update but search is normal
From the coredump information,it seem that the issue is the same as the jira: https://issues.apache.org/jira/browse/SOLR-4400: Rapidly opening and closing cores can lead to deadlock Mark Miller: Does the issue happen again? Thanks. From: Qun Wang Sent: 2013年6月20日 11:24 To: solr-user@lucene.apache.org (solr-user@lucene.apache.org) Subject: Solr 4.2 in SolrCloud mode lost response for update but search is normal Hi, all: I’m using SolrCloud with Solr 4.2, and currently a strange issue often happen and confuse me. The running env has three zookeepers and two solrs, used the same shard and six cores. Without any network and resource issue, I found that after a time, solr would lost response for update but query is normal. Servers neither process any update nor generate response info, seems somewhere entered a dead lock. Thread dump info for on machine is attached. Could someone help me check what’s the root issue? Thanks! __ Qun Wang Application Service - Backend Morningstar (Shenzhen) Ltd. Morningstar. Illuminating investing worldwide. +86 755 3311 0218 Office +86 186 6538 3975 Mobile qun.w...@morningstar.commailto:qun.w...@morningstar.com%0d This e-mail contains privileged and confidential information and is intended only for the use of the person(s) named above. Any dissemination, distribution or duplication of this communication without prior written consent from Morningstar is strictly prohibited. If you received this message in error please contact the sender immediately and delete the materials from any computer.