SolrCloud Archecture recommendations + related questions
Hi All, TL;DR version: We think we want to explore Lucene/Solr 4.0 and SolrCloud, but I’m not sure if there is any good doco/articles on how to make architecture choices for how to chop up big indexes… and what other general considerations are part of the equation? I’m throwing this post out to the public to see if any kind and knowledgeable individuals could provide some educated feedback on the options our team is currently considering for the future architecture of our Solr indexes. We have a loose collection of Solr indexes, each with a specific purpose and differing schemas and document makeup, containing just over 300 million documents with varying degrees of full-text. Our existing architecture is showing its age, as it is really just the setup used for small/medium indexes scaled upwards. The biggest individual index is around 140 million documents and currently exists as a Master/Slave setup with the Master receiving all writes in the background and the 3 load balanced slaves updating with a 5 minute poll interval. The master index is 451gb on disk and the 3 slaves are running JVMs with RAM allocations of 21gb (right now anyway). We are struggling under the traffic load and/or scale of our indexes (mainly the later I think). We know this isn’t the best way to run things, but the index in question is a fairly new addition and each time we run into issues we tend to make small changes to improve things in the short term… like bumping the RAM allocation up, toying with poll intervals, garbage collection config etc. We’ve historically run into issues with facet queries generating a lot of bloat on some types of fields. These had to be solved through internal modifications, but I expect we’ll have to review this with the new version anyway. Related to that, there are some question marks on generating good facet data from a sharded approach. In particular though, we are really struggling with garbage collection on the slave machines around the time that the slave/master sync occurs because of multiple copies of the index being held in memory until all searchers have de-referenced the old index. The machines typically either crash from OOM when we occasionally have a third and/or forth copy of the index appear because of really old searchers not ‘letting go’ (hence we play with widening poll intervals), or they seem to rarely become perpetually locked in GC and have to be restarted (not 100% why, but large heap allocations aren’t helping, and cache warming may be a culprit). The team has lots of things we want to try to improve things, but given the scale of the systems it is very hard to just try things out without considerable resourcing implications. The entire ecosystem is spread across 7 machines that are resourced in the 64gb-100gb of RAM range (this is just me poking around our servers… not a thorough assessment). Each machine is running several JVMs so that for each ‘type’ of index there are typically 2-4 load balanced slaves available at any given time. One of those machines is exclusively used as the Master for all indexes and receives no search traffic… just lots of write traffic. I believe the answers to some of these are going to be very much dependent on schemas and documents, so I don’t imagine anyone can answer the questions better then we can after testing and benchmarking… but right now we are still trying to choose where to start, so broad ideas are very welcome. The kind of things we are currently thinking about: - Moving to v4.0 (currently just completed our v3.5 upgrade) to take advantage of the reduced RAM consumption: https://issues.apache.org/jira/browse/LUCENE-2380 We are hoping that this has the double-whammy impact of improving garbage collection as well. Lots of full-text data should equal lots of Strings, and thus lots of savings from this change. - Moving to a basic sharded approach. We’ve only just started testing this, and I’m not involved, so I’m not sure on what early results we’ve got…. But: - Given that we’d like to move to v4.0, I believe this opens up the option of a SolrCloud implementation… my suspicion is that this is where the money is at… but I’d be happy to hear feedback (good or bad) from people that are using it in production. - Hardware; we are not certain that the current approach of a few colossal machines is any better that lots of smaller clustered machines… and it is prohibitively expensive to experiment here. We don’t think that our current setup using SSDs and fibre-channel connections would be creating too many bottlenecks on I/O, and rarely see other hardware related issues, but I’d again be curious if people have observed contradictory evidence. My suspicion is that with the changes above though, our current hardware would handle the load far better than it currently is. - Are there any sort of pros and cons documented out there for making decisions on sharding
error message in solr logs
Hi, we have a large lucene index base created using solr. Its split into 16 cores. Each core contains almost 10GB of indexes. We have deployed 8 instances of Solr hosting two cores each. The logic of identifying where the document resides based on the document id, is built within the application. There are other queries also which query all the cores on all the cores accross solr instances because the query may not be based on document id. We use SolrJ to connect to and query the indexes and get results. We have more reads than writes overall. A document is inserted once and updated a max of 2 times in a few days. But it could be potentially searched 10s of times in a day. Lately we are noticing below exception in our solr logs. This happens sometimes once or twice a day on a few cores. SEVERE: org.apache.solr.common.SolrException: Invalid chunk header at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:72) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:852) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:662) Caused by: com.ctc.wstx.exc.WstxIOException: Invalid chunk header at com.ctc.wstx.stax.WstxInputFactory.doCreateSR(WstxInputFactory.java:548) at com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:604) at com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:660) at com.ctc.wstx.stax.WstxInputFactory.createXMLStreamReader(WstxInputFactory.java:331) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:68) ... 17 more Caused by: java.io.IOException: Invalid chunk header at org.apache.coyote.http11.filters.ChunkedInputFilter.doRead(ChunkedInputFilter.java:133) at org.apache.coyote.http11.InternalInputBuffer.doRead(InternalInputBuffer.java:710) at org.apache.coyote.Request.doRead(Request.java:428) at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:304) at org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:405) at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:327) at org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:193) at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264) The env consists of: OS: Enterprise Linux 64 bit Tomcat version: 6.0.26 solr version: 3.3.0 JDK: 1.6 Total number of solr documents: more than 20 Million. Can someone please let me know what this is as googling around doesnt give me much info. Overall i dont see much problem from the application's use but i wanted to know what this error is and what could the impact be to the app in future. Thanks for any help in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/error-message-in-solr-logs-tp3999328.html Sent from the Solr - User mailing list archive at Nabble.com.
AW: Special suggestions requirement
Is there anything you cannot do with Solr? :-) Thanks a lot Erick! I only had to use . instead of ?, e.g. ...:8983/solr/terms?terms.fl=fieldnameterms.limit=100terms.prefix=abcdterms.regex.flag=case_insensitiveterms=trueterms.regex=abcd.. Adding terms.sort=index allows me even to sort as I need. Thanks, Alexander -Ursprüngliche Nachricht- Von: Erick Erickson [mailto:erickerick...@gmail.com] Gesendet: Samstag, 4. August 2012 20:11 An: solr-user@lucene.apache.org Betreff: Re: Special suggestions requirement Would it work to use TermsComponent with wildcards? Something like terms.regex=ABCD42??... see: http://wiki.apache.org/solr/TermsComponent/ Best Erick On Fri, Aug 3, 2012 at 9:07 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: I could be crazy, but it sounds to me like you need a trie, not a search index: http://en.wikipedia.org/wiki/Trie But in any case, what you want to do should be achievable. It seems like you need to do EdgeNgrams and facet on the results, where facet.counts 1 to exclude the actual part numbers, since each of those would be distinct. I'm on the train right now, so I can't test this. :\ Michael Della Bitta Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 www.appinions.com Where Influence Isn't a Game On Thu, Aug 2, 2012 at 9:19 PM, Lochschmied, Alexander alexander.lochschm...@vishay.com wrote: Even with prefix query, I do not get ABCD02 or any ABCD02... back. BTW: EdgeNGramFilterFactory is used on the field we are getting the suggestions/spellchecks from. I think the problem is that there are a lot of different part numbers starting with ABCD and every part number has the same length. I showed only 4 in the example but there might be thousands. Here are some full part number examples that might be in the index: ABCD110040 ABCD00 ABCD99 ABCD155500 ... I'm looking for a way to make Solr return distinct list of fixed length substrings of them, e.g. if ABCD is entered, I would need ABCD00 ABCD01 ABCD02 ABCD03 ... ABCD99 Then if user chose ABCD42 from the suggestions, I would need ABCD4201 ABCD4202 ABCD4203 ... ABCD4299 and so on. I would be able to do some post processing if needed or adjust the schema or indexing process. But the key functionality I need from Solr is returning distinct set of those suggestions where only the last two characters change. All of the available combinations of those last two characters must be considered though. I need to show alpha-numerically sorted suggestions; the smallest value first. Thanks, Alexander -Ursprüngliche Nachricht- Von: Michael Della Bitta [mailto:michael.della.bi...@appinions.com] Gesendet: Donnerstag, 2. August 2012 15:02 An: solr-user@lucene.apache.org Betreff: Re: Special suggestions requirement In this case, we're storing the overall value length and sorting it on that, then alphabetically. Also, how are your queries fashioned? If you're doing a prefix query, everything that matches it should score the same. If you're only doing a prefix query, you might need to add a term for exact matches as well to get them to show up. Michael Della Bitta Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 www.appinions.com Where Influence Isn't a Game On Wed, Aug 1, 2012 at 9:58 PM, Lochschmied, Alexander alexander.lochschm...@vishay.com wrote: Is there a way to offer distinct, alphabetically sorted, fixed length options? I am trying to suggest part numbers and I'm currently trying to do it with the spellchecker component. Let's say ABCD was entered and we have indexed part numbers like ABCD ABCD2000 ABCD2100 ABCD2200 ... I would like to have 2 characters suggested always, so for ABCD, it should suggest ABCD00 ABCD20 ABCD21 ABCD22 ... No smart sorting is needed, just alphabetically sorting. The problem is that for example 00 (or ABCD00) may not be suggested currently as it doesn't score high enough. But we are really trying to get all distinct values starting from the smallest (up to a certain number of suggestions). I was looking already at custom comparator class option. But this would probably not work as I would need more information to implement it there (like at least the currently entered search term, ABCD in the example). Thanks, Alexander
read write solr shard setup
Hi, i am trying to use a read/write solr setup. what i mean is that i would have a common location for lucene indexes and configure one instance of solr for reads and another instance to only write new indexes. Both the instances pointing to the same index location. The approach is given here http://wiki.apache.org/solr/NearRealtimeSearchTuning. http://wiki.apache.org/solr/NearRealtimeSearchTuning. . My question is: is there a way that i can read the documents from the read-only instance without calling the empty 'commit()'? I mean is there some configuration i can change in solrconfig.xml or something? I have the following configuration in solrconfig.xml autoCommit maxDocs1/maxDocs maxTime10/maxTime /autoCommit But this doesnt seem to help the RO node to be able to read the just-commited documents. -- View this message in context: http://lucene.472066.n3.nabble.com/read-write-solr-shard-setup-tp3999357.html Sent from the Solr - User mailing list archive at Nabble.com.
Returning page numbers where match occurs
Suppose, we are provisioning search over large text documents (e.g., Word, PPT). It would be nice to have the highlighter component to return the page numbers where the matches are found so that the same may be included in the search result summaries. What is the most efficient way to accomplish this? Thanks Debdoot -- View this message in context: http://lucene.472066.n3.nabble.com/Returning-page-numbers-where-match-occurs-tp3999370.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Returning page numbers where match occurs
There is an old, open Jira, SOLR-380 - There's no way to convert search results into page-level hits of a structured document., but no recent activity on it. It does have a lot of interesting commentary though. I wouldn't get my hopes up. See: https://issues.apache.org/jira/browse/SOLR-380 The short answer is that you would have to re-parse the document yourself since Tika/POI called from SolrCell simply parses the document into a linear, unstructured stream of text, with no markers for pages. The SOLR-380 Jira issue may give you some clues. I do have a related question: Would you want strictly integer page numbers where the first page of any front matter is 1 or the actual literal page numbers (e.g. iii or A-1). The former is simpler but incorrect if the user thinks they can simply look for that page number in the document. -- Jack Krupansky -Original Message- From: debdoot Sent: Monday, August 06, 2012 9:13 AM To: solr-user@lucene.apache.org Subject: Returning page numbers where match occurs Suppose, we are provisioning search over large text documents (e.g., Word, PPT). It would be nice to have the highlighter component to return the page numbers where the matches are found so that the same may be included in the search result summaries. What is the most efficient way to accomplish this? Thanks Debdoot -- View this message in context: http://lucene.472066.n3.nabble.com/Returning-page-numbers-where-match-occurs-tp3999370.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Returning page numbers where match occurs
Thanks a lot Jack for your prompt reply! The JIRA issue indeed talks about what I want to accomplish. I will try out Tricia's solution. As regards your question - whether I want real page numbers? Yes, ideally I want to get real page numbers (and am willing to put in the additional parsing effort to get those). For starters, even integer page numbers will work - do you have a simpler solution in mind for this case (than the one described in SOLR-380)? The text from the documents (for which I want page numbers) are represented in certain fields of the Solr/Lucene document that I index, i.e., a many-to-one relation exists between the office documents and Solr documents in my index. Regards, Debdoot -- View this message in context: http://lucene.472066.n3.nabble.com/Returning-page-numbers-where-match-occurs-tp3999370p3999380.html Sent from the Solr - User mailing list archive at Nabble.com.
Reg Default search field
Hi , I have a question on the default search field defined in schema.xml or in the later versions , specified as part of the search handlers. Do we always need to have this default search field defined in order to do search if the field name is not passed? Suppose , there is a field named 'Title' and it holds a value called 'solr'. In order to get results for this search - http://localhost:8080/solr/select/?q=solrversion=2.2start=0rows=10indent=on , do I need to define default search field and copy the contents of the specific field into this default search field ? DefaultSearchFieldtext/defaultSearchField copyField source=Title dest=text/ Thanks in advance! Lakshmi -- View this message in context: http://lucene.472066.n3.nabble.com/Reg-Default-search-field-tp3999387.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Reg Default search field
Lakschmi - The field(s) used for querying needs to be specified somewhere, either as a default field or as a qf parameter to (e)dismax, etc. Erik On Aug 6, 2012, at 10:48 , Lakshmi Bhargavi wrote: Hi , I have a question on the default search field defined in schema.xml or in the later versions , specified as part of the search handlers. Do we always need to have this default search field defined in order to do search if the field name is not passed? Suppose , there is a field named 'Title' and it holds a value called 'solr'. In order to get results for this search - http://localhost:8080/solr/select/?q=solrversion=2.2start=0rows=10indent=on , do I need to define default search field and copy the contents of the specific field into this default search field ? DefaultSearchFieldtext/defaultSearchField copyField source=Title dest=text/ Thanks in advance! Lakshmi -- View this message in context: http://lucene.472066.n3.nabble.com/Reg-Default-search-field-tp3999387.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Reg Default search field
defaultSearchField is deprecated in Solr 3.6. It is still supported, but the df query request parameter overrides it. So, go into solrconfig.xml and change the df parameter value from text to Title. -- Jack Krupansky -Original Message- From: Lakshmi Bhargavi Sent: Monday, August 06, 2012 10:48 AM To: solr-user@lucene.apache.org Subject: Reg Default search field Hi , I have a question on the default search field defined in schema.xml or in the later versions , specified as part of the search handlers. Do we always need to have this default search field defined in order to do search if the field name is not passed? Suppose , there is a field named 'Title' and it holds a value called 'solr'. In order to get results for this search - http://localhost:8080/solr/select/?q=solrversion=2.2start=0rows=10indent=on , do I need to define default search field and copy the contents of the specific field into this default search field ? DefaultSearchFieldtext/defaultSearchField copyField source=Title dest=text/ Thanks in advance! Lakshmi -- View this message in context: http://lucene.472066.n3.nabble.com/Reg-Default-search-field-tp3999387.html Sent from the Solr - User mailing list archive at Nabble.com.
[ANNOUNCE] Lucene/Solr @ ApacheCon Europe - August 13th Deadline for CFP and Travel Assistance applications
ApacheCon Europe will be happening 5-8 November 2012 in Sinsheim, Germany at the Rhein-Neckar-Arena. Early bird tickets go on sale this Monday, 6 August. http://www.apachecon.eu/ The Lucene/Solr track is shaping up to be quite impressive this year, so make your plans to attend and submit your session proposals ASAP! -- CALL FOR PAPERS -- The Call for Participation for ApacheCon Europe has been extended to 13 August! To submit a presentation and for more details, visit http://www.apachecon.eu/cfp/ Post a banner on your Website to show your support for ApacheCon Europe or North America (24-28 February 2013 in Portland, OR)! Download at http://www.apache.org/events/logos-banners/ We look forward to seeing you! -the Apache Conference Committee ApacheCon Planners --- TRAVEL ASSISTANCE --- We're pleased to announce Travel Assistance (TAC) applications for ApacheCon Europe 2012 are now open! The Travel Assistance Committee exists to help those that would like to attend ApacheCon events, but are unable to do so for financial reasons. For more info on this years Travel Assistance application criteria please visit the TAC website at http://www.apache.org/travel/ . Some important dates... The original application period officially opened on 23rd July, 2012. Applicants have until the 13th August 2012 to submit their applications (which should contain as much supporting material as required to efficiently and accurately process your request), this will enable the Travel Assistance Committee to announce successful awards on or shortly after the 24th August, 2012. As always TAC expects to deal with a range of applications from many diverse backgrounds so we encourage (as always) anyone thinking about sending in a TAC application to get it in ASAP. We look forward to greeting everyone in Sinsheim, Germany in November.
Re: Stopping replication?
Erick, Thank you for the courtesy of your reply. I was able to figure out the problem, and for the benefit of the list, I list the analysis. Judging by the caliber of those on this list, this is likely too basic for the interests of most, but newbies (among whom I still classify myself) might benefit. Here's what occurred: Recall that the version I'm using is 3.3. I don't know if these comments can extend to versions other than 3.3, but I suspect so. I noted in my initial plea: /I seem to recall that the slaves USED TO say Solr Replication Slave./ It turns out that is indeed the case, and that was a clue that they weren't being recognized as slave servers. The file solrconfig.xml contains the configuration setup for replication, under the entry requestHandler name=quot;/replication ... lt;/requestHandler. A slave knows it's a slave by the following entry: true http://://replication 00:00:60 The key here is the line true. There is at least one fancy way to define trueness or falseness - by defining the value as a parameter, and passing the resolution to the parameter in to solr when it starts. The reason for using this technique is to allow a single solrconfig.xml file to be deployed to all servers running solr, and then configuring those servers as slaves or the master at the time the servers start. (The information on doing this is in the solr wiki documentation for Replication at http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node, incidentally). In my case, I'm running solr under WebLogic 10.3.2 application server. I had defined the line: true as: ${org.apache.solr.handler.enable.slave:false} in my solrconfig.xml, and had been starting the WebLogic managed servers with the parameter -Dorg.apache.solr.handler.enable.master=false. Note that this parameter deals with the *master* and not the slave. This was working in my existing environment, and despite the fact that no -Dorg.apache.solr.handler.enable.slave=true parameter was being passed in from WebLogic, the slaves were able to recognize themselves as slaves. In the new WebLogic environment, this was no longer the case. I don't know why at this point. To solve the problem for the short term, I created a separate file for the slave servers that bypassed the whole parameter-resolution mechanism by defining that line under the slave configuration in its solrconfig.xml as: true That, of course, now leaves me with 2 solrconfig.xml files - one for the master server, and one for the slave servers. My bottom line is that at least it's now working, people are not being impacted, and I can troubleshoot the underlying issue at a more leisurely pace. Hope this helps someone, somewhere. Erick, thanks for taking an interest. Tim Hibbs -- View this message in context: http://lucene.472066.n3.nabble.com/Stopping-replication-tp3999272p3999445.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Stopping replication?
Erick, Thank you for the courtesy of your reply. I was able to figure out the problem, and for the benefit of the list, I list the analysis. Judging by the caliber of those on this list, this is likely too basic for the interests of most, but newbies (among whom I still classify myself) might benefit. Here's what occurred: Recall that the version I'm using is 3.3. I don't know if these comments can extend to versions other than 3.3, but I suspect so. I noted in my initial plea: /I seem to recall that the slaves USED TO say Solr Replication name Slave./ It turns out that is indeed the case, and that was a clue that they weren't being recognized as slave servers. The file solrconfig.xml contains the configuration setup for replication, under the entry requestHandler name=quot;/replication ... lt;/requestHandler. A slave knows it's a slave by the following entry: lst name=slave str name=enabletrue/str str name=masterUrlhttp://host:port/solr home location, in my case 'apache-solr-3.3.0'/replication/str str name=pollInterval00:00:60/str /lst The key here is the line str name=enabletrue/str. There is at least one fancy way to define trueness or falseness - by defining the value as a parameter, and passing the resolution to the parameter in to solr when it starts. The reason for using this technique is to allow a single solrconfig.xml file to be deployed to all servers running solr, and then configuring those servers as slaves or the master at the time the servers start. (The information on doing this is in the solr wiki documentation for Replication at http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node, incidentally). In my case, I'm running solr under WebLogic 10.3.2 application server. I had defined the line: str name=enabletrue/str as: str name=enable${org.apache.solr.handler.enable.slave:false}/str in my solrconfig.xml, and had been starting the WebLogic managed servers with the parameter -Dorg.apache.solr.handler.enable.master=false. Note that this parameter deals with the *master* and not the slave. This was working in my existing environment, and despite the fact that no -Dorg.apache.solr.handler.enable.slave=true parameter was being passed in from WebLogic, the slaves were able to recognize themselves as slaves. In the new WebLogic environment, this was no longer the case. I don't know why at this point. To solve the problem for the short term, I created a separate file for the slave servers that bypassed the whole parameter-resolution mechanism by defining that line under the slave configuration in its solrconfig.xml as: str name=enabletrue/str That, of course, now leaves me with 2 solrconfig.xml files - one for the master server, and one for the slave servers. My bottom line is that at least it's now working, people are not being impacted, and I can troubleshoot the underlying issue at a more leisurely pace. Hope this helps someone, somewhere. Erick, thanks for taking an interest. Tim Hibbs -- View this message in context: http://lucene.472066.n3.nabble.com/Stopping-replication-tp3999272p3999447.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Running out of memory
You might want to look at turning down or eliminating your caches if you're running out of RAM. Possibly some of them have a low hit rate, which you can see on the Stats page. Caches with a low hit rate are only consuming RAM and CPU cycles. Also, using this JVM arg might reduce the memory footprint: -XX:+UseCompressedOops In the end though, the surefire solution would be to go to an instance type with more RAM: http://www.ec2instances.info/ Michael Della Bitta Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 www.appinions.com Where Influence Isn’t a Game On Mon, Aug 6, 2012 at 1:48 PM, Jon Drukman jdruk...@gmail.com wrote: Hi there. I am running Solr 1.4.1 on an Amazon EC2 box with 7.5GB of RAM. It was set up about 18 months ago and has been largely trouble-free. Unfortunately, lately it has started to run out of memory pretty much every day. We are seeing SEVERE: java.lang.OutOfMemoryError: Java heap space When that happens, a simple query like http://localhost:8983/solr/select?q=*:*' returns nothing. I am starting Solr with the following: /usr/lib/jvm/jre/bin/java -XX:+UseConcMarkSweepGC -Xms1G -Xmx5G -jar start.jar It would be vastly preferable if Solr could just exit when it gets a memory error, because we have it running under daemontools, and that would cause an automatic restart. After restarting, Solr works fine for another 12-18 hours. Not ideal but at least it wouldn't require human intervention to get it going again. What can I do to reduce the memory pressure? Does Solr require the entire index to fit in memory at all times? The on disk size is 15GB. There are 27.5 million documents, but they are all tiny (mostly one line forum comments like this game is awesome). We're using Sun openJava SDK 1.6 if that matters. -jsd-
Multiple Embedded Servers Pointing to single solrhome/index
Hi, I'm trying to use two embedded solr servers pointing to a same solrhome / index. So that's something like System.setProperty(solr.solr.home, SomeSolrDir); CoreContainer.Initializer initializer = new CoreContainer.Initializer(); CoreContainer coreContainer = initializer.initialize(); m_server = new EmbeddedSolrServer(coreContainer, ); on both applications. The problem is, after I have done one add+commit SolrInputDocument on one embedded server, the other server would fail to obtain write lock any more. I'm thinking there must be a way of releasing write lock so other servers may pick up. Is there an API that does so? Any inputs are appreciated. Bing
Two questions on spellchecking
Hi, even though I read a lot, none of my spellchecker configurations works really well. I reached a dead end. Maybe someone could help, to solve my challenges. - How can I get case sensitive suggestions, independent of the given case in the query? - How to configure a 'did you mean' spellchecking, as discussed in https://issues.apache.org/jira/browse/SOLR-2585 (Context-Sensitive Spelling Suggestions Collations) I'm using following environment: - Solr 4.0-alpha (downloaded 25. June) - Java 7 - schema.xml fieldType name=textSuggest class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType ... field name=suggest type=textSuggest indexed=true stored=true required=false multiValued=true / - solrconfig.xml (suggester) requestHandler name=/hint class=org.apache.solr.handler.component.SearchHandler lst name=defaults str name=echoParamsall/str str name=spellchecktrue/str str name=spellcheck.dictionarysuggester/str str name=spellcheck.extendedResultstrue/str str name=spellcheck.onlyMorePopularfalse/str str name=spellcheck.count20/str /lst arr name=components strsuggester/str /arr /requestHandler searchComponent name=suggester class=solr.SpellCheckComponent lst name=spellchecker str name=namesuggester/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str str name=fieldsuggest/str /lst /searchComponent - solrconfig.xml (spellcheck) requestHandler name=standard class=solr.StandardRequestHandler default=true lst name=defaults str name=echoParamsall/str int name=rows10/int str name=dfallfields/str str name=spellcheck.extendedResultstrue/str str name=spellcheck.onlyMorePopularfalse/str str name=spellcheck.count20/str /lst arr name=last-components strspellcheck/str /arr /requestHandler searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetextSpell/str lst name=spellchecker str name=namedefault/str str name=fieldsuggest/str str name=classnamesolr.DirectSolrSpellChecker/str str name=distanceMeasureinternal/str float name=accuracy0.1/float int name=maxEdits2/int int name=minPrefix1/int int name=maxInspections5/int int name=minQueryLength1/int float name=maxQueryFrequency0.1/float float name=thresholdTokenFrequency0.001/float /lst /searchComponent *Suggester problem* With this configuration the suggester works not case sensitive, but the hints are all lower case. Example: .../hint?q=dawt=xmlspellcheck=truespellcheck.build=true ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime173/intlst name=paramsstr name=spellchecktrue/strstr name=echoParamsall/strstr name=spellcheck.extendedResultstrue/strstr name=spellcheck.dictionarysuggester/strstr name=spellcheck.count20/strstr name=spellcheck.onlyMorePopularfalse/strstr name=spellchecktrue/strstr name=qda/strstr name=wtxml/strstr name=spellcheck.buildtrue/str/lst/lststr name=commandbuild/strlst name=spellchecklst name=suggestionslst name=daint name=numFound20/intint name=startOffset0/intint name=endOffset2/intarr name=suggestionstrdat-marktspiegel spezial/strstrdata structures with c++ using stl/strstrdata warehouse/strstrdatan, ingeborg/strstrdatenbanken mit delphi/strstrdatenverschlüsselung/strstrdauner, gabriele/strstrdautermann, margit/strstrdavid copperfield/strstrdavid, horst/strstrdav id, leo/strstrdavid, nicholas/strstrdavis, charles t./strstrdavis, edward l/strstrdavis, leslie dorfman/strstrdavis, stanley m./strstrdavor kommt noch/strstrdavydova, irina n./strstrdawidowski, bernd/strstrdayan, daniel/str/arr/lstbool name=correctlySpelledfalse/bool/lst/lst /response Using just solr.StrField as field type, the suggestion are true to original capitalization, but I get no suggestions, if the query starts with a lower case character. *Spelling problem* One of the indexed entries in the field 'suggest' is David Copperfield and I want this string as alternative suggestion to the query David opperfield. Example .../select?q=david+opperfieldrows=0wt=xmlspellcheck=true ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime15/intlst name=paramsstr name=dfallfields/strstr name=echoParamsall/strstr name=spellcheck.extendedResultstrue/strstr name=spellcheck.count20/strstr name=spellcheck.onlyMorePopularfalse/strstr name=rows0/strstr name=spellchecktrue/strstr name=qdavid opperfield/strstr
Re: Trending topics?
Chris, I'm not sure if Solr by itself can really do this (easily and/or well). Have a look at http://sematext.com/products/key-phrase-extractor/index.html which can do exactly that, but without Solr. Some of the highlighted bits refer to trending topics, though not using exactly that terminology. Otis Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm From: Chris Dawson xrdaw...@gmail.com To: solr-user@lucene.apache.org Sent: Thursday, August 2, 2012 11:34 AM Subject: Trending topics? How would I generate a list of trending topics using solr? Chris
Re: Multiple Embedded Servers Pointing to single solrhome/index
Where is the common index? On NFS? If it is on a native hard disk (on the same computer) Solr uses the file locking mechanism supplied by the operating system (Linux or Windows). This may not be working right. See this for more info on file locking: http://wiki.apache.org/lucene-java/AvailableLockFactories On Mon, Aug 6, 2012 at 10:56 AM, Bing Hua bh...@cornell.edu wrote: Hi, I'm trying to use two embedded solr servers pointing to a same solrhome / index. So that's something like System.setProperty(solr.solr.home, SomeSolrDir); CoreContainer.Initializer initializer = new CoreContainer.Initializer(); CoreContainer coreContainer = initializer.initialize(); m_server = new EmbeddedSolrServer(coreContainer, ); on both applications. The problem is, after I have done one add+commit SolrInputDocument on one embedded server, the other server would fail to obtain write lock any more. I'm thinking there must be a way of releasing write lock so other servers may pick up. Is there an API that does so? Any inputs are appreciated. Bing -- Lance Norskog goks...@gmail.com
Re: Problem with Solr 4.0-ALPHA and JSON response
On Fri, Jul 27, 2012 at 6:32 PM, Federico Valeri fedeval...@gmail.com wrote: Hi all, Hi, I'm new to Solr, I have a problem with JSON format, this is my Java client code: The java client (SolrServer) can only operate with xml or javabin format. If you need to get the json response from Solr by using java you could just use a http client directly and bypass the solr client. Now the problem is that I recieve the response but it doesn't trigger the javascript callback function. I see wt=javabin in SolrCore.execute log, even if I set wt=json in paramters, is this normal? yes, to control the format used by the client there's a method HttpSolrServer#setParser that set's the client parser (that also overrides the wt param when the request is made) -- Sami Siren