Re: Best practices for Solr (how to update jar files safely)
I side with Toke on this. Enterprise bare metal machines often have hundreds of gigs of memory and tens of CPU cores -- you would have to fit multiple instances in a machine to make use of them to circumvent huge heaps. If this is not a common case now, it could well be in the future the way hardware evolves -- so I would rather mention the factors which need multiple instances than discourage them. On 20 Feb 2016 14:55, "Toke Eskildsen"wrote: > Shawn Heisey wrote: > > I've updated the "Taking Solr to Production" reference guide page with > > what I feel is an appropriate caution against running multiple instances > > in a typical installation. I'd actually like to use stronger language, > > And I would like you to use softer language. > > Machines gets bigger all the time and as you state yourself, GC can > (easily) be a problem with the heap grows. With reference to the 32GB JVM > limit for small pointers, a max Xmx just below 32GB looks like a practical > choice for a Solr installation (if possible of course): Running 2 instances > of 31GB will provide more usable memory than a single instance of 64GB. > > https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/ > > Caveat: I have not done any testing on this with Solr, so I do not know > how large the effect is. Some things, such as String faceting, DocValues > structures and some of the field caches are array-of-atomics oriented and > will not suffer with larger pointers. Other things, such as numerics > faceting, large rows-settings and grouping uses a lot of objects and will > require more memory. The overhead will differ depending on usage. > > We tend to use separate Solr installations on the same machines. For some > machines we do it to allow for independent upgrades (long story), for > others because a heap of 200GB is not something we are ready to experiment > with. > > - Toke Eskildsen >
Re: It's possible up and debug solr in eclipse IDE?
I should add to Erick's point that the test framework allows you to test HTTP APIs through an embedded Jetty instance, so you should be able to do anything that you do with a remote Solr instance from code.. On 12 Jan 2016 18:24, "Erick Erickson"wrote: > And a neater way to debug stuff rather than attaching to > Solr is to step through the Junit tests that exercise the code > you need to work on rather than attach to a remote Solr. > This is often much faster rather than compile/start solr/attach. > > Of course some problems don't fit that process, but I thought > I'd mention it. > > Best, > Erick > > On Tue, Jan 12, 2016 at 4:08 AM, Vincenzo D'Amore > wrote: > > Mmmm... I'm not sure it worth the trouble. Anyway, I'm just curious, when > > you find a way let me know. > > > > On Tue, Jan 12, 2016 at 1:01 PM, Rodrigo Testillano < > > rodrite.testill...@gmail.com> wrote: > > > >> Yes, with remote debug is working, but i want up a jetty with solr in > >> Eclipse like i did with tomcat in older versions. Thank you very much > for > >> your help! I am going to try other way to do it, but maybe will be not > >> possible > >> > >> 2016-01-12 12:51 GMT+01:00 Rodrigo Testillano < > >> rodrite.testill...@gmail.com> > >> : > >> > >> > Thank you so much!, I'm going to try right now and tell you my > results!! > >> > > >> > 2016-01-12 12:47 GMT+01:00 Vincenzo D'Amore : > >> > > >> >> Yep. > >> >> > >> >> I have done this just few hours ago. > >> >> Let's download Solr source: > >> >> > >> >> wget > >> http://it.apache.contactlab.it/lucene/solr/5.4.0/solr-5.4.0-src.tgz > >> >> > >> >> untar the file. > >> >> > >> >> I'm not sure we need, but I have already installed latest versions > of: > >> >> ant, > >> >> ivy and maven. > >> >> > >> >> Then in the solr-5.4.0 directory I did this: > >> >> > >> >> ant resolve > >> >> > >> >> ant eclipse > >> >> > >> >> Now you can import solr-5.4.0 as eclipse project. > >> >> > >> >> Under the hood the ant "eclipse" task have created .project and > >> .classpath > >> >> and .settings directory. > >> >> > >> >> Now if you want debug, all you need to do is create with eclipse a > java > >> >> remote debug configuration and start solr with the debugging > parameters: > >> >> > >> >> ./solr start -m 4g a "-Xdebug > >> >> -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1044" > >> >> > >> >> :) > >> >> > >> >> On Tue, Jan 12, 2016 at 12:31 PM, Rodrigo Testillano < > >> >> rodrite.testill...@gmail.com> wrote: > >> >> > >> >> > I need debug my custom processor (updateRequestProcessor) in my > >> Eclipse > >> >> > IDE. With old Solr version was possible, but with the solr like a > >> >> service > >> >> > with jetty i don't know if exists some way to do > >> >> > -- > >> >> > Un Saludo. > >> >> > > >> >> > Rodrigo Testillano Tordesillas. > >> >> > > >> >> > >> >> > >> >> > >> >> -- > >> >> Vincenzo D'Amore > >> >> email: v.dam...@gmail.com > >> >> skype: free.dev > >> >> mobile: +39 349 8513251 > >> >> > >> > > >> > > >> > > >> > -- > >> > Un Saludo. > >> > > >> > Rodrigo Testillano Tordesillas. > >> > > >> > >> > >> > >> -- > >> Un Saludo. > >> > >> Rodrigo Testillano Tordesillas. > >> > > > > > > > > -- > > Vincenzo D'Amore > > email: v.dam...@gmail.com > > skype: free.dev > > mobile: +39 349 8513251 >
Re: Number of requests to each shard is different with and without using of grouping
M is the number of ids you want for each group, specified by group.limit. It's unrelated to the number of rows requested.. On 21 Aug 2015 19:54, SolrUser1543 osta...@gmail.com wrote: Ramkumar R. Aiyengar wrote Grouping does need 3 phases.. The phases are: (2) For the N groups, each shard is asked for the top M ids (M is configurable per request). What do you exactly means by /M is configurable per request/ ? how exactly is it configurable and what is the relation between N ( which is initial rows number ) and M ? -- View this message in context: http://lucene.472066.n3.nabble.com/Number-of-requests-to-each-shard-is-different-with-and-without-using-of-grouping-tp4224293p4224521.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR to SOLR communication with custom authentication
Custom authentication support was added in 5x, and the imminent (in the next few days) 5.3 release has a lot of features in this regard, including a basic authentication module, I would suggest upgrading to it. 5x versions (include 5.3) do support Java 7, so I don't see an issue here? On 20 Aug 2015 12:48, Prasad Bodapati prasad.bodap...@pb.com wrote: Hi All, We have cluster environment on JBOSS, All of our deployed applications are protected by OpenAM including SOLR. On Slave nodes we enabled SOLR to communicate with master nodes to get data. Since the SOLR on master is protected with OpenAM slave can't talk to it. In Solr.xml there is a way to configure replication requests to use basic HTTP authentication but not to use custom authentication. I have tried to override ReplicationHandler and SnapPuller classes to use provide custom authentication but I couldn't. I have tried to follow instructions at https://wiki.apache.org/solr/SolrSecurity but I could not find the classes org.apache.solr.security.InterSolrNodeAuthCredentialsFactory.SubRequestFactory and org.apache.solr.security.InterSolrNodeAuthCredentialsFactory.SubRequestFactory. Have anyone of you used custom authentication before for replocation ? Any help would be greatly appreciated. Environment SOLR version: 4.10.2 (We can't upgrade at moment as we use Java 7) JBOSS 6.2 EAP Thanks, Prasad
Re: Number of requests to each shard is different with and without using of grouping
Grouping does need 3 phases.. The phases are: (1) Each shard is asked for the top N groups (instead of ids), with the sort value. The federator then sorts the groups from all shards and chooses the top N groups. (2) For the N groups, each shard is asked for the top M ids (M is configurable per request). The top M ids from each shard for every group is again sorted within each group to find the overall top M. At the end of this phase, you have the top N groups with the top M ids for each group. (3) The final phase gets the stored fields for these M*N ids. On 20 Aug 2015 20:00, SolrUser1543 osta...@gmail.com wrote: I want to understand, why number of requests in SOLD CLOUD is different with and without using of grouping feature. 1. suppose we have several shards in SOLR CLOUD ( lets say 3 shards ) 2. One of them, gets a query with rows = n 3. This shards distributes a request among others and suppose that every shard has a lot of results , much more than n . 4. Then it receives an item IDs from each shards , so the number of results in total is 3n 5. Then it sorts the results and chooses the best n results , when in my case each shard has representatives in total results . 6. Then it send a second request to each shard , with appropriate item IDs , to get a stored fields . So then in this case ,each shard will be queried twice, first one to get item IDs , and the second to get stored fields . That is what I see in my logs . ( I see 6 log entries , 2 for each shard ) *The question is , why when I am using a grouping feature, the number of request to each shard is 3 instead of 2 times ?* ( I see 8 or 9 log entries ) -- View this message in context: http://lucene.472066.n3.nabble.com/Number-of-requests-to-each-shard-is-different-with-and-without-using-of-grouping-tp4224293.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 5.2.1 on Solaris
Please open a JIRA with details of what the issues are, we should try to support this.. On 18 Jun 2015 15:07, Bence Vass bence.v...@inso.tuwien.ac.at wrote: Hello, Is there any documentation on how to start Solr 5.2.1 on Solaris (Solaris 10)? The script (solr start) doesn't work out of the box, is anyone running Solaris 5.x on Solaris? - Thanks
Re: Please help test the new Angular JS Admin UI
I started with an empty Solr instance and Firefox 38 on Linux. This is the trunk source.. There's a 'No cores available. Go and create one' button available in the old and the new UI. In the old UI, clicking it goes to the core admin, and pops open the dialog for Add Core. The new UI only goes to the core admin. Also, when you then click on the Add Core, the dialog bleeds into the sidebar. I then started with a getting started config and a cloud of 2x2. Then brought up admin UI on one of them, opened up one of the cores, and clicked on the Files tab -- that showed an exception.. {data:{responseHeader:{status:500,QTime:1},error:{msg:Path must not end with / character,trace:java.lang.IllegalArgumentException: Path must not end with / character\n\tat org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:58)\n\tat org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1024)\n\tat org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkClient.java:319)\n\tat org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkClient.java:316)\n\tat org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)\n\tat org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:316)\n\tat org.apache.solr.handler.admin.ShowFileRequestHandler.getAdminFileFromZooKeeper(ShowFileRequestHandler.java:324)\n\tat org.apache.solr.handler.admin.ShowFileRequestHandler.showFromZooKeeper(ShowFileRequestHandler.java:148)\n\tat org.apache.solr.handler.admin.ShowFileRequestHandler.handleRequestBody(ShowFileRequestHandler.java:135)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:2057)\n\tat org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:648)\n\tat org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:452)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)\n\tat Moving to Plugins/Stats, and then Core, and selecting the first searcher entry (e.g. for me, it is Searcher@3a7bd1[gettingstarted_shard1_replica1] main), I see stats like: - searcherName:Searcher@#8203;3a7bd1[gettingstarted_shard1_replica1] main - reader: ExitableDirectoryReader(#8203;UninvertingDirectoryReader(#8203;)) Notice the unescaped characters there..
Re: SolrCloud Leader Election
This shouldn't happen, but if it does, there's no good way currently for Solr to automatically fix it. There are a couple of issues being worked on to do that currently. But till then, your best bet is to restart the node which you expect to be the leader (you can look at ZK to see who is at the head of the queue it maintains). If you can't figure that out, safest is to just stop/start all nodes in sequence, and if that doesn't work, stop all nodes and start them back one after the other. On 21 May 2015 00:24, Ryan Steele ryan.ste...@pgi.com wrote: My SolrCloud cluster isn't reassigning the collections leaders from downed cores--the downed cores are still listed as the leaders. The cluster has been in the state for a few hours and the logs continue to report No registered leader was found after waiting for 4000ms. Is there a way to force it to reassign the leader? I'm running SolrCloud 5.0. I have 7 Solr nodes, 3 Zookeeper nodes, and 3739 collections. Thanks, Ryan --- This email has been scanned for email related threats and delivered safely by Mimecast. For more information please visit http://www.mimecast.com ---
Re: Multiple index.timestamp directories using up disk space
Yes, data loss is the concern. If the recovering replica is not able to retrieve the files from the leader, it at least has an older copy. Also, the entire index is not fetched from the leader, only the segments which have changed. The replica initially gets the file list from the replica, checks against what it has, and then downloads the difference -- then moves it to the main index. Note that this process can fail sometimes (say due to I/O errors, or due to a problem with the leader itself), in which case the replica drops all accumulated files from the leader, and starts from scratch. If that happens, it needs to look back at its old index again to figure out what it needs to download on the next attempt. May be with a fair number of assumptions which should usually hold good, you can still come up with a mechanism to drop existing files, but those won't hold good in case of serious issues with the cloud, you could end up losing data. That's worse than using a bit more disk space! On 4 May 2015 11:56, Rishi Easwaran rishi.easwa...@aol.com wrote: Thanks for the responses Mark and Ramkumar. The question I had was, why does Solr need 2 copies at any given time, leading to 2x disk space usage. Not sure if this information is not published anywhere, and makes HW estimation almost impossible for large scale deployment. Even if the copies are temporary, this becomes really expensive, especially when using SSD in production, when the complex size is over 400TB indexes, running 1000's of solr cloud shards. If a solr follower has decided that it needs to do replication from leader and capture full copy snapshot. Why can't it delete the old information and replicate from scratch, not requiring more disk space. Is the concern data loss (a case when both leader and follower lose data)?. Thanks, Rishi. -Original Message- From: Mark Miller markrmil...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Tue, Apr 28, 2015 10:52 am Subject: Re: Multiple index.timestamp directories using up disk space If copies of the index are not eventually cleaned up, I'd fill a JIRA to address the issue. Those directories should be removed over time. At times there will have to be a couple around at the same time and others may take a while to clean up. - Mark On Tue, Apr 28, 2015 at 3:27 AM Ramkumar R. Aiyengar andyetitmo...@gmail.com wrote: SolrCloud does need up to twice the amount of disk space as your usual index size during replication. Amongst other things, this ensures you have a full copy of the index at any point. There's no way around this, I would suggest you provision the additional disk space needed. On 20 Apr 2015 23:21, Rishi Easwaran rishi.easwa...@aol.com wrote: Hi All, We are seeing this problem with solr 4.6 and solr 4.10.3. For some reason, solr cloud tries to recover and creates a new index directory - (ex:index.20150420181214550), while keeping the older index as is. This creates an issues where the disk space fills up and the shard never ends up recovering. Usually this requires a manual intervention of bouncing the instance and wiping the disk clean to allow for a clean recovery. Any ideas on how to prevent solr from creating multiple copies of index directory. Thanks, Rishi.
Re: Multiple index.timestamp directories using up disk space
SolrCloud does need up to twice the amount of disk space as your usual index size during replication. Amongst other things, this ensures you have a full copy of the index at any point. There's no way around this, I would suggest you provision the additional disk space needed. On 20 Apr 2015 23:21, Rishi Easwaran rishi.easwa...@aol.com wrote: Hi All, We are seeing this problem with solr 4.6 and solr 4.10.3. For some reason, solr cloud tries to recover and creates a new index directory - (ex:index.20150420181214550), while keeping the older index as is. This creates an issues where the disk space fills up and the shard never ends up recovering. Usually this requires a manual intervention of bouncing the instance and wiping the disk clean to allow for a clean recovery. Any ideas on how to prevent solr from creating multiple copies of index directory. Thanks, Rishi.
Re: Restart solr failed after applied the patch in https://issues.apache.org/jira/browse/SOLR-6359
It shouldn't be any different without the patch, or with the patch and (100,10) as parameters. Which is why I wanted you to check with 100,10.. If you see the same issue with that, then the patch is probably not an issue, may be it is with the patched build in general.. On 30 Mar 2015 13:01, forest_soup tanglin0...@gmail.com wrote: But if the value can only be 100,10, is there any difference with no that patch? Can we enlarge those 2 values? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Restart-solr-failed-after-applied-the-patch-in-https-issues-apache-org-jira-browse-SOLR-6359-tp4196251p4196280.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Restart solr failed after applied the patch in https://issues.apache.org/jira/browse/SOLR-6359
I doubt this has anything to do with the patch. Do you observe the same behaviour if you reduce the values for the config to defaults? (100, 10) On 30 Mar 2015 09:51, forest_soup tanglin0...@gmail.com wrote: https://issues.apache.org/jira/browse/SOLR-6359 I also posted the questions to the JIRA ticket. We have a SolrCloud with 5 solr servers of Solr 4.7.0. There are one collection with 80 shards(2 replicas per shard) on those 5 servers. And we made a patch by merge the patch (https://issues.apache.org/jira/secure/attachment/12702473/SOLR-6359.patch ) to 4.7.0 stream. And after applied the patch to our servers with the config changing uploaded to ZooKeeper, we did a restart on one of the 5 solr server, we met some issues on that server. Below is the details - The solrconfig.xml we changed: updateLog str name=dir$ {solr.ulog.dir:} /str int name=numRecordsToKeep1/int int name=maxNumLogsToKeep100/int /updateLog After we restarted one solr server without other 4 servers are running, we met below exceptions in the restarted one: ERROR - 2015-03-16 20:48:48.214; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Exception writing document id Q049bGx0bWFpbDIxL089bGxwX3VzMQ==41703656!B68BF5EC5A4A650D85257E0A00724A3B to the index; possible analysis error. at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:703) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:857) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:556) at org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:96) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:166) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:136) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:225) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:121) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:190) at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:116) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:173) at org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:106) at org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:780) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1040) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:607) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:314) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1156) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:626) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:804) Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed at
Re: How to use ConcurrentUpdateSolrServer for Secured Solr?
Not a direct answer, but Anshum just created this.. https://issues.apache.org/jira/browse/SOLR-7275 On 20 Mar 2015 23:21, Furkan KAMACI furkankam...@gmail.com wrote: Is there anyway to use ConcurrentUpdateSolrServer for secured Solr as like CloudSolrServer: HttpClientUtil.setBasicAuth(cloudSolrServer.getLbServer().getHttpClient(), username, password); I see that there is no way to access HTTPClient for ConcurrentUpdateSolrServer? Kind Regards, Furkan KAMACI
Re: Want to modify Solr Source Code
Is your concern that you want to be able to modify source code just on your machine or that you can't for some reason install svn? If it's the former, even if you checkout using svn, you can't modify anything outside the machine as changes can be checked in only by the committers of the project. You will need to raise a JIRA for the changes to go back in as described by the wiki page. If the latter, try downloading the source code using the downloads section in https://lucene.apache.org/solr and choose the download which ends as -src.tgz, that has the source bundled as a single file. On 17 Mar 2015 07:42, Nitin Solanki nitinml...@gmail.com wrote: Hi Gora, Hi, I want to make changes only into my machine without svn. I want to do test on source code. How ? Any steps to do so ? Please help.. On Tue, Mar 17, 2015 at 1:01 PM, Gora Mohanty g...@mimirtech.com wrote: On 17 March 2015 at 12:22, Nitin Solanki nitinml...@gmail.com wrote: Hi, I want to modify the solr source code. I don't have any idea where source code is available. I want to edit source code. How can I do ? Any help please... Please start with: http://wiki.apache.org/solr/HowToContribute#Contributing_Code_.28Features.2C_Bug_Fixes.2C_Tests.2C_etc29 Regards, Gora
Re: Whole RAM consumed while Indexing.
Yes, and doing so is painful and takes lots of people and hardware resources to get there for large amounts of data and queries :) As Erick says, work backwards from 60s and first establish how high the commit interval can be to satisfy your use case.. On 16 Mar 2015 16:04, Erick Erickson erickerick...@gmail.com wrote: First start by lengthening your soft and hard commit intervals substantially. Start with 6 and work backwards I'd say. Ramkumar has tuned the heck out of his installation to get the commit intervals to be that short ;). I'm betting that you'll see your RAM usage go way down, but that' s a guess until you test. Best, Erick On Sun, Mar 15, 2015 at 10:56 PM, Nitin Solanki nitinml...@gmail.com wrote: Hi Erick, You are saying correct. Something, **overlapping searchers warning messages** are coming in logs. **numDocs numbers** are changing when documents are adding at the time of indexing. Any help? On Sat, Mar 14, 2015 at 11:24 PM, Erick Erickson erickerick...@gmail.com wrote: First, the soft commit interval is very short. Very, very, very, very short. 300ms is just short of insane unless it's a typo ;). Here's a long background: https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ But the short form is that you're opening searchers every 300 ms. The hard commit is better, but every 3 seconds is still far too short IMO. I'd start with soft commits of 6 and hard commits of 6 (60 seconds), meaning that you're going to have to wait 1 minute for docs to show up unless you explicitly commit. You're throwing away all the caches configured in solrconfig.xml more than 3 times a second, executing autowarming, etc, etc, etc Changing these to longer intervals might cure the problem, but if not then, as Hoss would say, details matter. I suspect you're also seeing overlapping searchers warning messages in your log, and it;s _possible_ that what's happening is that you're just exceeding the max warming searchers and never opening a new searcher with the newly-indexed documents. But that's a total shot in the dark. How are you looking for docs (and not finding them)? Does the numDocs number in the solr admin screen change? Best, Erick On Thu, Mar 12, 2015 at 10:27 PM, Nitin Solanki nitinml...@gmail.com wrote: Hi Alexandre, *Hard Commit* is : autoCommit maxTime${solr.autoCommit.maxTime:3000}/maxTime openSearcherfalse/openSearcher /autoCommit *Soft Commit* is : autoSoftCommit maxTime${solr.autoSoftCommit.maxTime:300}/maxTime /autoSoftCommit And I am committing 2 documents each time. Is it good config for committing? Or I am good something wrong ? On Fri, Mar 13, 2015 at 8:52 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: What's your commit strategy? Explicit commits? Soft commits/hard commits (in solrconfig.xml)? Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 12 March 2015 at 23:19, Nitin Solanki nitinml...@gmail.com wrote: Hello, I have written a python script to do 2 documents indexing each time on Solr. I have 28 GB RAM with 8 CPU. When I started indexing, at that time 15 GB RAM was freed. While indexing, all RAM is consumed but **not** a single document is indexed. Why so? And it through *HTTPError: HTTP Error 503: Service Unavailable* in python script. I think it is due to heavy load on Zookeeper by which all nodes went down. I am not sure about that. Any help please.. Or anything else is happening.. And how to overcome this issue. Please assist me towards right path. Thanks.. Warm Regards, Nitin Solanki
Re: Jetty version
Yes, Solr 5.0 uses Jetty 8. FYI, the upcoming release 5.1 will move to Jetty 9. Also, just in case it matters -- as noted in the 5.0 release notes, the use of Jetty is now an implementation detail and we might move away from it in the future -- so you shouldn't be depending on Solr using Jetty or a particular version of Jetty. On 12 Mar 2015 10:33, Aman Tandon amantandon...@gmail.com wrote: Hi, I am not sure but when i am looking into the server/lib directory then i am able to see the version 8.1 with all those lib files present in that folder. So i am guessing its version 8.1. I confirmed it by downloading the new jetty server which was jetty-9.2 and i found the same version on jetty libraries. With Regards Aman Tandon On Thu, Mar 12, 2015 at 12:19 PM, Philippe de Rochambeau phi...@free.fr wrote: Hello, which jetty version does solr 5 integrate? Cheers, Philippe
Re: 4.10.4 - nodes up, shard without leader
The update log replay issue looks like https://issues.apache.org/jira/browse/SOLR-6583 On 9 Mar 2015 01:41, Mark Miller markrmil...@gmail.com wrote: Interesting bug. First there is the already closed transaction log. That by itself deserves a look. I'm not even positive we should be replaying the log we reconnecting from ZK disconnect, but even if we do, this should never happen. Beyond that there seems to be some race. Because of the log trouble, we try and cancel the election - but we don't find the ephemeral election node yet for some reason and so just assume it's fine, no node there to remove (well, we WARN, because it is a little unexpected). Then that ephemeral node materializes I guess, and the new leader doesn't register because the old leader won't give up the thrown. We don't try and force the new leader because that may just hide bugs and cause data loss, we no leader is elected. I'd guess there are two JIRA issues to resolve here. - Mark On Sun, Mar 8, 2015 at 8:37 AM Markus Jelsma markus.jel...@openindex.io wrote: Hello - i stumbled upon an issue i've never seen earlier, a shard with all nodes up and running but no leader. This is on 4.10.4. One of the two nodes emits the following error log entry: 2015-03-08 05:25:49,095 WARN [solr.cloud.ElectionContext] - [Thread-136] - : cancelElection did not find election node to remove /overseer_elect/election/93434598784958483-178.21.116. 225:8080_solr-n_000246 2015-03-08 05:25:49,121 WARN [solr.cloud.ElectionContext] - [Thread-136] - : cancelElection did not find election node to remove /collections/oi/leader_elect/shard3/election/93434598784958483-178.21.116. 225:8080_solr_oi_h-n_43 2015-03-08 05:25:49,220 ERROR [solr.update.UpdateLog] - [Thread-136] - : Error inspecting tlog tlog{file=/opt/solr/cores/oi_c/data/tlog/tlog.0001394 refcount=2} java.nio.channels.ClosedChannelException at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99) at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:679) at org.apache.solr.update.ChannelFastInputStream. readWrappedStream(TransactionLog.java:784) at org.apache.solr.common.util.FastInputStream.refill( FastInputStream.java:89) at org.apache.solr.common.util.FastInputStream.read( FastInputStream.java:125) at java.io.InputStream.read(InputStream.java:101) at org.apache.solr.update.TransactionLog.endsWithCommit( TransactionLog.java:218) at org.apache.solr.update.UpdateLog.recoverFromLog( UpdateLog.java:800) at org.apache.solr.cloud.ZkController.register( ZkController.java:841) at org.apache.solr.cloud.ZkController$1.command( ZkController.java:277) at org.apache.solr.common.cloud.ConnectionManager$1$1.run( ConnectionManager.java:166) 2015-03-08 05:25:49,225 ERROR [solr.update.UpdateLog] - [Thread-136] - : Error inspecting tlog tlog{file=/opt/solr/cores/oi_c/data/tlog/tlog.0001471 refcount=2} java.nio.channels.ClosedChannelException at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99) at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:679) at org.apache.solr.update.ChannelFastInputStream. readWrappedStream(TransactionLog.java:784) at org.apache.solr.common.util.FastInputStream.refill( FastInputStream.java:89) at org.apache.solr.common.util.FastInputStream.read( FastInputStream.java:125) at java.io.InputStream.read(InputStream.java:101) at org.apache.solr.update.TransactionLog.endsWithCommit( TransactionLog.java:218) at org.apache.solr.update.UpdateLog.recoverFromLog( UpdateLog.java:800) at org.apache.solr.cloud.ZkController.register( ZkController.java:841) at org.apache.solr.cloud.ZkController$1.command( ZkController.java:277) at org.apache.solr.common.cloud.ConnectionManager$1$1.run( ConnectionManager.java:166) 2015-03-08 12:21:04,438 WARN [solr.cloud.RecoveryStrategy] - [zkCallback-2-thread-28] - : Stopping recovery for core=oi_h coreNodeName= 178.21.116.225:8080_solr_oi_h The other node makes a mess in the logs: 2015-03-08 05:25:46,020 WARN [solr.cloud.RecoveryStrategy] - [zkCallback-2-thread-20] - : Stopping recovery for core=oi_c coreNodeName= 194.145.201.190: 8080_solr_oi_c 2015-03-08 05:26:08,670 ERROR [solr.cloud.ShardLeaderElectionContext] - [zkCallback-2-thread-19] - : There was a problem trying to register as the leader:org. apache.solr.common.SolrException: Could not register as the leader because creating the ephemeral registration node in ZooKeeper failed at org.apache.solr.cloud.ShardLeaderElectionContextBase .runLeaderProcess(ElectionContext.java:146) at org.apache.solr.cloud.ShardLeaderElectionContext. runLeaderProcess(ElectionContext.java:317) at
Re: Using tmpfs for Solr index
I don't have formal benchmarks, but we did get significant performance gains by switching from a RAMDirectory to a MMapDirectory on tmpfs, especially under parallel queries. Locking seemed to pull down the former.. On 23 Jan 2015 06:35, deniz denizdurmu...@gmail.com wrote: Would it boost any performance in case the index has been switched from RAMDirectoryFactory to use tmpfs? Or it would simply do the same thing like MMap? And in case it would be better to use tmpfs rather than RAMDirectory or MMap, which directory factory would be the most feasible one for this purpose? Regards, - Zeki ama calismiyor... Calissa yapar... -- View this message in context: http://lucene.472066.n3.nabble.com/Using-tmpfs-for-Solr-index-tp4181399.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Recovery process
https://issues.apache.org/jira/browse/SOLR-6359 has a patch which allows this to be configured, it has not gone in as yet. Note that the current design of the UpdateLog causes it to be less efficient if the number is bumped up too much, but certainly worth experimenting with. On 22 Jan 2015 02:47, Nishanth S nishanth.2...@gmail.com wrote: Thank you Shalin.So in a system where the indexing rate is more than 5K TPS or so the replica will never be able to recover through peer sync process.In my case I have mostly seen step 3 where a full copy happens and if the index size is huge it takes a very long time for replicas to recover.Is there a way we can configure the number of missed updates for peer sync. Thanks, Nishanth On Wed, Jan 21, 2015 at 4:47 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Hi Nishanth, The recovery happens as follows: 1. PeerSync is attempted first. If the number of new updates on leader is less than 100 then the missing documents are fetched directly and indexed locally. The tlog tells us the last 100 updates very quickly. Other uses of the tlog are for durability of updates and of course, startup recovery. 2. If the above step fails then replication recovery is attempted. A hard commit is called on the leader and then the leader is polled for the latest index version and generation. If the leader's version and generation are greater than local index's version/generation then the difference of the index files between leader and replica are fetched and installed. 3. If the above fails (because leader's version/generation is somehow equal or more than local) then a full index recovery happens and the entire index from the leader is fetched and installed locally. There are some other details involved in this process too but probably not worth going into here. On Wed, Jan 21, 2015 at 5:13 PM, Nishanth S nishanth.2...@gmail.com wrote: Hello Everyone, I am hitting a few issues with solr replicas going into recovery and then doing a full index copy.I am trying to understand the solr recovery process.I have read a few blogs on this and saw that when leader notifies a replica to recover(in my case it is due to connection resets) it will try to do a peer sync first and if the missed updates are more than 100 it will do a full index copy from the leader.I am trying to understand what peer sync is and where does tlog come into picture.Are tlogs replayed only during server restart?.Can some one help me with this? Thanks, Nishanth -- Regards, Shalin Shekhar Mangar.
Re: Easiest way to embed solr in a desktop application
That's correct, even though it should still be possible to embed Jetty, that could change in the future, and that's why support for pluggable containers is being taken away. If you need to deal with the index at a lower level, there's always Lucene you can use as a library instead of Solr. But I am assuming you need to use the search engine at a higher level than that and hence you ask for Solr. In which case, I urge you to think through if you really can't run this out of process, may be this is an XY problem. Keep in mind that Solr has the ability to provide higher level functionality because it can control almost the entirety of the application (which is the philosophical reason behind removal of the war as well), and that's the reason something like EmbeddedSolrServer will always have caveats. On 15 Jan 2015 15:09, Robert Krüger krue...@lesspain.de wrote: I was considering the programmatic Jetty option but then I read that Solr 5 no longer supports being run with an external servlet container but maybe they still support programmatic jetty use in some way. atm I am using solr 4.x, so this would work. No idea if this gets messy classloader-wise in any way. I have been using exactly the approach you described in the past, i.e. I built a really, really simple swing dialogue to input queries and display results in a table but was just guessing that the built-in ui was far superior but maybe I should just live with it for the time being. On Thu, Jan 15, 2015 at 3:56 PM, Erik Hatcher erik.hatc...@gmail.com wrote: It’d certainly be easiest to just embed Jetty into your application. You don’t need to have Jetty as a separate process, you could launch it through it’s friendly Java API, configured to use solr.war. If all you needed was to make HTTP(-like) queries to Solr instead of the full admin UI, your application could stick to using EmbeddedSolrServer and also provide a UI that takes in a Solr query string (or builds one up) and then sends it to the embedded Solr and displays the result. Erik On Jan 15, 2015, at 9:44 AM, Robert Krüger krue...@lesspain.de wrote: Hi Andrea, you are assuming correctly. It is a local, non-distributed index that is only accessed by the containing desktop application. Do you know if there is a possibility to run the Solr admin UI on top of an embedded instance somehow? Thanks a lot, Robert On Thu, Jan 15, 2015 at 3:17 PM, Andrea Gazzarini a.gazzar...@gmail.com wrote: Hi Robert, I've used the EmbeddedSolrServer in a scenario like that and I never had problems. I assume you're talking about a standalone application, where the whole index resides locally and you don't need any cluster / cloud / distributed feature. I think the usage of EmbeddedSolrServer is discouraged in a (distributed) service scenario, because it is a direct connection to a SolrCore instance...but this is not a problem in the situation you described (as far as I know) Best, Andrea On 01/15/2015 03:10 PM, Robert Krüger wrote: Hi, I have been using an embedded instance of solr in my desktop application for a long time and it works fine. At the time when I made that decision (vs. firing up a solr web application within my swing application) I got the impression embedded use is somewhat unsupported and I should expect problems. My first question is, is this still the case now (4 years later), that embedded solr is discouraged? The one limitation I am running into is that I cannot use the solr admin UI for debugging purposes (mainly for running queries). Is there any other way to do this other than no longer using embedded solr and programmatically firing up a web application (e.g. using jetty)? Should I do the latter anyway? Any insights/advice greatly appreciated. Best regards, Robert -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com
Re: Solr startup script in version 4.10.3
Versions 4.10.3 and beyond already use server rather than example, which still finds a reference in the script purely for back compat. A major release 5.0 is coming soon, perhaps the back compat can be removed for that. On 6 Jan 2015 09:30, Dominique Bejean dominique.bej...@eolya.fr wrote: Hi, In release 4.10.3, the following lines were removed from solr starting script (bin/solr) # TODO: see SOLR-3619, need to support server or example # depending on the version of Solr if [ -e $SOLR_TIP/server/start.jar ]; then DEFAULT_SERVER_DIR=$SOLR_TIP/server else DEFAULT_SERVER_DIR=$SOLR_TIP/example fi However, the usage message always say -d dir Specify the Solr server directory; defaults to server Either the usage have to be fixed or the removed lines put back to the script. Personally, I like the default to server directory. My installation process in order to have a clean empty solr instance is to copy examples into server and remove directories like example-DIH, example-shemaless, multicore and solr/collection1 Solr server (or node) can be started without the -d parameter. If this makes sense, a Jira issue could be open. Dominique http://www.eolya.fr/
Re: Dealing with bad apples in a SolrCloud cluster
As Eric mentions, his change to have a state where indexing happens but querying doesn't surely helps in this case. But these are still boolean decisions of send vs don't send. In general, it would be nice to abstract the routing policy so that it is pluggable. You could then do stuff like have a least pending policy for choosing replicas -- instead of choosing a replica at random, you maintain a pending response count, and you always send to the one with least pending (or randomly amongst a set of replicas if there is a tie). Also the chances your distrib=false case will be hit is actually 1/5 (or something like that, I have forgotten my probability theory). Because you have two shards and you get two chances at hitting the bad apple. This was one of the reasons we got in SOLR-6730 to use replica and host affinity. Under good enough load, the load distribution will more or less be the same with this change, but chances of hitting bad apples will be lesser.. On 21 Nov 2014 18:56, Timothy Potter thelabd...@gmail.com wrote: Just soliciting some advice from the community ... Let's say I have a 10-node SolrCloud cluster and have a single collection with 2 shards with replication factor 10, so basically each shard has one replica on each of my nodes. Now imagine one of those nodes starts getting into a bad state and starts to be slow about serving queries (not bad enough to crash outright though) ... I'm sure we could ponder any number of ways a box might slow down without crashing. From my calculations, about 2/10ths of the queries will now be affected since 1/10 queries from client apps will hit the bad apple + 1/10 queries from other replicas will hit the bad apple (distrib=false) If QPS is high enough and the bad apple is slow enough, things can start to get out of control pretty fast, esp. since we've set max threads so high to avoid distributed dead-lock. What have others done to mitigate this risk? Anything we can do in Solr to help deal with this? It seems reasonable that nodes can identify a bad apple by keeping track of query times and looking for nodes that are significantly outside (=2 stddev) what the other nodes are doing. Then maybe mark the node as being down in ZooKeeper so clients and other nodes stop trying to send requests to it; or maybe a simple policy of just don't send requests to that node for a few minutes.
Re: any difference between using collection vs. shard in URL?
Do keep one thing in mind though. If you are already doing the work of figuring out the right shard leader (through solrJ or otherwise), using that location with just the collection name might be suboptimal if there are multiple shard leaders present in the same instance -- the collection name just goes to *some* shard leader and not necessarily to the one where your document is destined. If it chooses the wrong one, it will lead to a HTTP request to itself. On 5 Nov 2014 15:33, Shalin Shekhar Mangar shalinman...@gmail.com wrote: There's no difference between the two. Even if you send updates to a shard url, it will still be forwarded to the right shard leader according to the hash of the id (assuming you're using the default compositeId router). Of course, if you happen to hit the right shard leader then it is just an internal forward and not an extra network hop. The advantage with using the collection name is that you can hit any SolrCloud node (even the ones not hosting this collection) and it will still work. So for a non Java client, a load balancer can be setup in front of the entire cluster and things will just work. On Wed, Nov 5, 2014 at 8:50 PM, Ian Rose ianr...@fullstory.com wrote: If I add some documents to a SolrCloud shard in a collection alpha, I can post them to /solr/alpha/update. However I notice that you can also post them using the shard name, e.g. /solr/alpha_shard4_replica1/update - in fact this is what Solr seems to do internally (like if you send documents to the wrong node so Solr needs to forward them over to the leader of the correct shard). Assuming you *do* always post your documents to the correct shard, is there any difference between these two, performance or otherwise? Thanks! - Ian -- Regards, Shalin Shekhar Mangar.
Re: Sharding configuration
On 30 Oct 2014 23:46, Erick Erickson erickerick...@gmail.com wrote: This configuration deals with all the replication, NRT processing, self-repair when nodes go up and down and all that, but since there's no second trip to get the docs from shards your query performance won't be affected. More or less.. Vaguely recall that you still would need to add a shortCircuit parameter to the url in such a case to avoid a second trip. I might be wrong here but I do recall wondering why that wasn't the default.. And using SolrCloud with a single shard will essentially scale linearly as you add nodes for queries. Best, Erick On Thu, Oct 30, 2014 at 8:29 AM, Anca Kopetz anca.kop...@kelkoo.com wrote: Hi, You are right, it is a mistake in my phrase, for the tests with 4 shards/ 4 instances, the latency was worse (therefore *bigger*) than for the tests with one shard. In our case, the query rate is high. Thanks, Anca On 10/30/2014 03:48 PM, Shawn Heisey wrote: On 10/30/2014 4:32 AM, Anca Kopetz wrote: We did some tests with 4 shards / 4 different tomcat instances on the same server and the average latency was smaller than the one when having only one shard. We tested also é shards on different servers and the performance results were also worse. It seems that the sharding does not make any difference for our index in terms of latency gains. That statement is confusing, because if latency goes down, that's good, not worse. If you're going to put multiple shards on one server, it should be done with one solr/tomcat instance, not multiple. One instance is perfectly capable of dealing with many shards, and has a lot less overhead. The SolrCloud collection create command would need the maxShardsPerNode parameter. In order to see a gain in performance from multiple shards per server, the server must have a lot of CPUs and the query rate must be fairly low. If the query rate is high, then all the CPUs will be busy just handling simultaneous queries, so putting multiple shards per server will probably slow things down. When query rate is low, multiple CPUs can handle each shard query simultaneously, speeding up the overall query. Thanks, Shawn Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: Sharding configuration
On 30 Oct 2014 14:49, Shawn Heisey apa...@elyograg.org wrote: In order to see a gain in performance from multiple shards per server, the server must have a lot of CPUs and the query rate must be fairly low. If the query rate is high, then all the CPUs will be busy just handling simultaneous queries, so putting multiple shards per server will probably slow things down. When query rate is low, multiple CPUs can handle each shard query simultaneously, speeding up the overall query. Except that your query latency isn't always CPU bound, there's a significant IO bound portion as well. I wouldn't go so far as to say that will large query volumes you shouldn't use multiple shards -- finally comes down to how many shards a machine can handle under peak load, it could depend on CPU/IO/GC pressure.. We have multiple shards on a machine under heavy query load for example. The only real way is to benchmark this and see.. Thanks, Shawn
Re: Sharding configuration
As far as the second option goes, unless you are using a large amount of memory and you reach a point where a JVM can't sensibly deal with a GC load, having multiple JVMs wouldn't buy you much. With a 26GB index, you probably haven't reached that point. There are also other shared resources at an instance level like connection pools and ZK connections, but those are tunable and you probably aren't pushing them as well (I would imagine you are just trying to have only a handful of shards given that you aren't sharded at all currently). That leaves single vs multiple machines. Assuming the network isn't a bottleneck, and given the same amount of resources overall (number of cores, amount of memory, IO bandwidth times number of machines), it shouldn't matter between the two. If you are procuring new hardware, I would say buy more, smaller machines, but if you already have the hardware, you could serve as much as possible off a machine before moving to a second. There's nothing which limits the number of shards as long as the underlying machine has the sufficient amount of parallelism. Again, this advice is for a small number of shards, if you had a lot more (hundreds) of shards and significant volume of requests, things start to become a bit more fuzzy with other limits kicking in. On 28 Oct 2014 09:26, Anca Kopetz anca.kop...@kelkoo.com wrote: Hi, We have a SolrCloud configuration of 10 servers, no sharding, 20 millions of documents, the index has 26 GB. As the number of documents has increased recently, the performance of the cluster decreased. We thought of sharding the index, in order to measure the latency. What is the best approach ? - to use shard splitting and have several sub-shards on the same server and in the same tomcat instance - having several shards on the same server but on different tomcat instances - having one shard on each server (for example 2 shards / 5 replicas on 10 servers) What's the impact of these 3 configuration on performance ? Thanks, Anca -- Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: Advice on highlighting
https://issues.apache.org/jira/plugins/servlet/mobile#issue/LUCENE-2878 provides lucene API what you are trying to do, it's not yet in though. There's a fork which has the change in https://github.com/flaxsearch/lucene-solr-intervals On 12 Sep 2014 21:24, Craig Longman clong...@iconect.com wrote: In order to take our Solr usage to the next step, we really need to improve its highlighting abilities. What I'm trying to do is to be able to write a new component that can return the fields that matched the search (including numeric fields) and the start/end positions for the alphanumeric matches. I see three different approaches take, either way will require making some modifications to the lucene/solr parts, as it just does not appear to be doable as a completely stand alone component. 1) At initial search time. This seemed like a good approach. I can follow IndexSearcher creating the TermContext that parses through AtomicReaderContexts to see if it contains a match and then adds it to the contexts available for later. However, at this point, inside SegmentTermsEnum.seekExact() it seems like Solr is not really looking for matching terms as such, it's just scanning what looks like the raw index. So, I don't think I can easily extract term positions at this point. 2) Write a odified HighlighterComponent. We have managed to get phrases to highlight properly, but it seems like getting the full field matches would be more difficult in this module, however, because it does its highlighting oblivious to any other criteria, we can't use it as is. For example, this search: (body:large+AND+user_id:7)+OR+user_id:346 Will highlight large in records that have user_id = 346 when technically (for our purposes at least) it should not be considered a hit because the large was accompanied by the user_id = 7 criteria. It's not immediately clear to me how difficult it would be to change this. 3) Make a modified DebugComponent and enhance the existing explain() methods (in the query types we require it at least) to include more information such as the start/end positions of the term that was hit. I'm exploring this now, but I don't easily see how I can figure out what those positions might be from the explain() information. Any pointers on how, at the point that TermQuery.explain() is being called that I can figure out which indexed token was the actual hit on? Craig Longman C++ Developer iCONECT Development, LLC 519-645-1663 This message and any attachments are intended only for the use of the addressee and may contain information that is privileged and confidential. If the reader of the message is not the intended recipient or an authorized representative of the intended recipient, you are hereby notified that any dissemination of this communication is strictly prohibited. If you have received this communication in error, notify the sender immediately by return email and delete the message and any attachments from your system.
Re: Scaling to large Number of Collections
On 31 Aug 2014 13:24, Mark Miller markrmil...@gmail.com wrote: On Aug 31, 2014, at 4:04 AM, Christoph Schmidt christoph.schm...@moresophy.de wrote: we see at least two problems when scaling to large number of collections. I would like to ask the community, if they are known and maybe already addressed in development: We have a SolrCloud running with the following numbers: - 5 Servers (each 24 CPUs, 128 RAM) - 13.000 Collection with 25.000 SolrCores in the Cloud The Cloud is working fine, but we see two problems, if we like to scale further 1. Resource consumption of native system threads We see that each collection opens at least two threads: one for the zookeeper (coreZkRegister-1-thread-5154) and one for the searcher (searcherExecutor-28357-thread-1) We will run in OutOfMemoryError: unable to create new native thread. Maybe the architecture could be changed here to use thread pools? 2. The shutdown and the startup of one server in the SolrCloud takes 2 hours. So a rolling start is about 10h. For me the problem seems to be that leader election is linear. The Overseer does core per core. The organisation of the cloud is not done parallel or distributed. Is this already addressed by https://issues.apache.org/jira/browse/SOLR-5473 or is there more needed? 2. No, but it should have been fixed by another issue that will be in 4.10. Note however that this fix will result in even more temporary thread usage as all leadership elections will happen in parallel, so you might still end up with these out of threads issue again. Quite possibly the out of threads issue is just some system soft limit which is kicking in. Linux certainly has a limit you can configure through sysctl, your OS, whatever that might be, probably does the same. May be worth exploring if you can bump that up. - Mark http://about.me/markrmiller
Re: Why does CLUSTERSTATUS return different information than the web cloud view?
ZK has the list of live nodes available as a set of ephemeral nodes. You can use /zookeeper on Solr or talk to ZK directly to get that list. On 24 Aug 2014 03:08, Nathan Neulinger nn...@neulinger.org wrote: Is there a way to query the 'live node' state without sending a query to every node myself? i.e. to get the same data that is used for that cloud status screen? -- Nathan On 08/23/2014 06:39 PM, Mark Miller wrote: The state is actually a combo of the state in clusterstate and the live nodes. If the live node is not there, it's gone regardless of the last state it published. - Mark On Aug 23, 2014, at 6:00 PM, Nathan Neulinger nn...@neulinger.org wrote: In particular, a shard being 'active' vs. 'gone'. The web ui is clearly showing the given replicas as being in Gone state when I shut down a server, yet the CLUSTERSTATUS says that each replica has state: active Is there any way to ask it for status that will reflect that the replica is gone? This is with 4.8.0. -- Nathan Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412 -- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
Re: Disabling transaction logs
(1) sounds a lot like SOLR-6261 I mention above. There are possibly other improvements since 4.6.1 as Mark mentions, I would certainly suggest you test with the latest version with the issue above patched (or use the current stable branch in svn, branch_4x) to see if that makes a difference.
Re: Disabling transaction logs
I didn't realise you could even disable tlog when running SolrCloud, but as Anshum says it's a bad idea. In all possibility, even if it worked, removing transaction logs is likely to make your restart slower, SolrCloud would always be forced to do a full recovery because it cannot now use tlogs to do recovery which it tries to to speed things up. Your problem is probably elsewhere. How many replicas do you have? Do you see this problem always or when you bounce the leaders? SOLR-6261 recently speeded that up and is scheduled to roll out with the next release but you can try out the patch meanwhile. Hello I am using solr 4.6.1 with over 1000 collections and 8 nodes. Restarting of nodes takes a long time (especially if we have indexing running against it) . I want to see if disabling transaction logs can help with a better robust restart. However I can't see any docs around disabling txn logs in solrcloud Can anyone help with info on how to disable transaction logs ? Thanks Nitin
Re: SolrCloud without NRT and indexing only on the master
I agree with Erick that this gain you are looking at might not be worth, so do measure and see if there's a difference. Also, the next release of Solr is to have some significant improvements when it comes to CPU usage under heavy indexing load, and we have had at least one anecdote so far where the throughput has increased by an order of magnitude, so one option might be to try that out as well and see. See SOLR-6136 and potentially SOLR-6259 (probably lesser so, depends on your schema) if you want to try out before the release. An another option is to use the HDFS directory support in Solr. That way you can build indices offline and make them available for all your Solr replicas for search. See batch indexing at http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_introducing.html On 30 Jul 2014 11:54, Harald Kirsch harald.kir...@raytion.com wrote: Hi Daniel, well, I assume there is a performance difference on host B between a) getting some ready-made segments from host A (master, taking care of indexing) to host B (slave, taking care of answering queries) and b) host B (along with host A) doing all the work necessary to prepare incoming SolrDocument objects into a segment and make it searchable. I am talking here about a setup where during peak loads the CPUs on host B are sweating at 80% and I assume the following: i) Indexing will draw more than 20% CPU. Thereby it would start competing with query answering ii) Merely copying finished segments to the query-answering node will not draw more than 20% CPU and will thereby not compete with query answering. Index consistency is not an issue, because the number of documents and the number of different, hard-to-get-at source we will be indexing will always be out-of-sync with the index. Adding and hour or two here is the least of my problems. Harald. On 30.07.2014 11:58, Daniel Collins wrote: Working backwards slightly, what do you think SolrCloud is going to give you, apart from the consistency of the index (which you want to turn off)? What are all the other benefits of SolrCloud, if you are querying separate instances that aren't guaranteed to be in sync (since you want to use the traditional-style master-slave for indexing. And secondly, why don't you want to use SolrCloud for indexing everywhere? Again, what do you think master-slave methodology gains you? You have said you want all the resources of the slaves to be for querying, which makes sense, but the slaves have to get the new updates somehow, surely? Whether that is from SolrCloud directly, or via a master-slave replication, the work has to be done at some point? If you don't have NRT, and you set your commit frequency to something reasonably large, then I don't see the cost of SolrCloud, but I guess it depends on the frequency of your updates. On 30 July 2014 08:22, Harald Kirsch harald.kir...@raytion.com wrote: Thanks Erick, for the confirmation. You say traditional but the docs call it legacy. Not a native speaker I might misinterpret the meaning slightly but to me it conveys the notion of don't use this stuff if you don't have to. SolrCloud indexes to all nodes all the time, there's no real way to turn that off. which is really a pity when only query-load must be scaled and NRT is not necessary. :-/ Harald. On 29.07.2014 18:16, Erick Erickson wrote: bq: What if I don't need NRT and in particular want the slave to use all resources for query answering, i.e. only the master shall index. But at the same time I want all the other benefits of SolrCloud. You want all the benefits of SolrCloud without... using SolrCloud? Your only two choices are traditional master/slave or SolrCloud. SolrCloud indexes to all nodes all the time, there's no real way to turn that off. You _can_ control the frequency of commits but you can't turn off the indexing to all the nodes. FWIW, Erick On Tue, Jul 29, 2014 at 5:41 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: I never did it, but always like. http://lucene.472066.n3.nabble.com/Best-practice-for- rebuild-index-in-SolrCloud-td4054574.html From time to time such recipes are mentioned in the list. On Tue, Jul 29, 2014 at 12:39 PM, Harald Kirsch harald.kir...@raytion.com wrote: Hi all, from the Solr documentation I find two options how replication of an indexing is handled: a) SolrCloud indexes on master and all slaves in parallel to support NRT (near realtime search) b) Legacy replication where only the master does the indexing and slave receive index copies once in a while. What if I don't need NRT and in particular want the slave to use all resources for query answering, i.e. only the master shall index. But at the same time I want all the other benefits of SolrCloud. Is this setup possible? Is it somewhere described in the docs? Thanks, Harald.
Re: Anybody knows of a project that indexes SVN repos into Solr?
Not an exact answer.. OpenGrok uses Lucene, but not Solr. On 2 Jun 2014 07:48, Alexandre Rafalovitch arafa...@gmail.com wrote: Hello, Anybody knows of a recent projects that index SVN repos for Solr search? With or without UI. I know of similar efforts for other VCS, but the only thing I found for SVN is from 2010 and looking quiet. Regards, Alex. P.s. This could also be a cool show-off project for somebody. Plenty of SVN repos around to use as data source. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
Re: Distributed Search in Solr with different queries per shard
I agree with Eric that this is premature unless you can show that it makes a difference. Firstly why are you splitting the data into multiple time tiers (one recent, and one all) and then waiting to merge results from all of them? Time tiering is useful when you can do the search separately on both and then pick the one which comes back with full results first (usually will be the recent one but it might not have as many results as you want). The way you are trying to aggregate the data is sharding, where one of the cores doesn't have the data the other one has. So you could just 'optimize' by not having the data present in the historical collection. We have support for custom sharding keys now in Solr, haven't used it personally but that might be worth a shot.. On 21 May 2014 14:57, Avner Levy av...@checkpoint.com wrote: I have 2 cores. One with active data and one with historical data (for documents which were removed from the active one). I want to run Distributed Search on both and get the unified result (as supported by Solr Distributed Search, I'm not using Solr Cloud). My problem is that the query for each core is different. Is there a way to specify different query per core and still let Solr to unify the query results? For example: Active data core query: select all green docs History core query: select all green docs with year=2012 Is there a way to extend the distributed search handler to support such a scenario? Thanks in advance, Avner · One option is to send a unified query to both but then each core will work harder for no reason.
Re: Can I reconstruct text from tokens?
Sorry, didn't think this through. You're right, still the same problem.. On 16 Apr 2014 17:40, Alexandre Rafalovitch arafa...@gmail.com wrote: Why? I want stored=false, at which point multivalued field is just offset values in the dictionary. Still have to reconstruct from offsets. Or am I missing something? Regards, Alex On 16/04/2014 10:59 pm, Ramkumar R. Aiyengar andyetitmo...@gmail.com wrote: Logically if you tokenize and put the results in a multivalued field, you should be able to get all values in sequence? On 16 Apr 2014 16:51, Alexandre Rafalovitch arafa...@gmail.com wrote: Hello, If I use very basic tokenizers, e.g. space based and no filters, can I reconstruct the text from the tokenized form? So, This is a test - This, is, a, test - This is a test? I know we store enough information, but I don't know internal API enough to know what I should be looking at for reconstruction algorithm. Any hints? The XY problem is that I want to store large amount of very repeatable text into Solr. I want the index to be as small as possible, so thought if I just pre-tokenized, my dictionary will be quite small. And I will be reconstructing some final form anyway. The other option is to just use compressed fields on stored field, but I assume that does not take cross-document efficiencies into account. And, it will be a read-only index after build, so I don't care about updates messing things up. Regards, Alex Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
Re: Can I reconstruct text from tokens?
Logically if you tokenize and put the results in a multivalued field, you should be able to get all values in sequence? On 16 Apr 2014 16:51, Alexandre Rafalovitch arafa...@gmail.com wrote: Hello, If I use very basic tokenizers, e.g. space based and no filters, can I reconstruct the text from the tokenized form? So, This is a test - This, is, a, test - This is a test? I know we store enough information, but I don't know internal API enough to know what I should be looking at for reconstruction algorithm. Any hints? The XY problem is that I want to store large amount of very repeatable text into Solr. I want the index to be as small as possible, so thought if I just pre-tokenized, my dictionary will be quite small. And I will be reconstructing some final form anyway. The other option is to just use compressed fields on stored field, but I assume that does not take cross-document efficiencies into account. And, it will be a read-only index after build, so I don't care about updates messing things up. Regards, Alex Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
Re: svn vs GIT
ant compile / ant -f solr dist / ant test certainly work, I use them with a git working copy. You trying something else? On 14 Apr 2014 19:36, Jeff Wartes jwar...@whitepages.com wrote: I vastly prefer git, but last I checked, (admittedly, some time ago) you couldn't build the project from the git clone. Some of the build scripts assumed some svn commands will work. On 4/12/14, 3:56 PM, Furkan KAMACI furkankam...@gmail.com wrote: Hi Amon; There has been a conversation about it at dev list: http://search-lucene.com/m/PrTmPXyDlv/The+Old+Git+Discussionsubj=Re+The+O ld+Git+Discussion On the other hand you do not need to know SVN to use, develop and contribute to Apache Solr project. You can follow the project at GitHub: https://github.com/apache/lucene-solr Thanks; Furkan KAMACI 2014-04-11 5:12 GMT+03:00 Aman Tandon amantandon...@gmail.com: thanks sir, in that case i need to know about svn as well. Thanks Aman Tandon On Fri, Apr 11, 2014 at 7:26 AM, Alexandre Rafalovitch arafa...@gmail.comwrote: You can find the read-only Git's version of Lucene+Solr source code here: https://github.com/apache/lucene-solr . The SVN preference is Apache Foundation's choice and legacy. Most of the developers' workflows are also around SVN. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Fri, Apr 11, 2014 at 7:48 AM, Aman Tandon amantandon...@gmail.com wrote: Hi, I am new here, i have question in mind that why we are preferring the svn more than git? -- With Regards Aman Tandon
Re: update in SolrCloud through C++ client
If only availability is your concern, you can always keep a list of servers to which your C++ clients will send requests, and round robin amongst them. If one of the servers go down, you will either not be able to reach it or get a 500+ error in the HTTP response, you can take it out of circulation (and probably retry in the background with some kind of a ping every minute or so to these down servers to ascertain if they have come back and then add them to the list). This is something SolrJ does currently. This doesn't technically need any Zookeeper interaction. The biggest benefit that SolrJ provides (since 4.6 I think) though is that it finds the shard leader to send an update to using ZK and saves a hop. You can technically do this by retrieving and listening to updates using a C++ ZK client (available) and doing what SolrJ currently does. This would be good, the only drawback though, apart from the effort, is that improvements are still happening in the area of managing clusters and how its state is saved with ZK. These changes might not break your code, but at the same time you might not be able to take advantage of them without additional effort. An alternative approach is to link SolrJ into your C++ client using JNI. This has the added benefit of using the Javabin format for requests which would have some performance benefits. In short, it comes down to what performance requirements are. If indexing speed and throughput is not that big a deal, just go with having a list of servers and load balancing amongst the active ones. I would suggest you try this anyway before second guessing that you do need the optimization. If not, I would probably try the JNI route, and if that fails, using a C ZK client to read the cluster state and using that knowledge to decide where to send requests. On 14 Feb 2014 10:58, neerajp neeraj_star2...@yahoo.com wrote: Hello All, I am using Solr for indexing my data. My client is in C++. So I make Curl request to Solr server for indexing. Now, I want to use indexing in SolrCloud mode using ZooKeeper for HA. I read the wiki link of SolrCloud (http://wiki.apache.org/solr/SolrCloud). What I understand from wiki that we should always check solr instance status(up running) in solrCloud before making an update request. Can I not send update request to zookeeper and let the zookeeper forwards it to appropriate replica/leader ? In the later case I need not to worry which servers are up and running before making indexing request. -- View this message in context: http://lucene.472066.n3.nabble.com/update-in-SolrCloud-through-C-client-tp4117340.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud Zookeeper disconnection/reconnection
Start with http://wiki.apache.org/solr/SolrPerformanceProblems It has a section on GC tuning and a link to some example settings. On 16 Feb 2014 21:19, lboutros boutr...@gmail.com wrote: Thanks a lot for your answer. Is there a web page, on the wiki for instance, where we could find some JVM settings or recommandations that we should used for Solr with some index configurations? Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Zookeeper-disconnection-reconnection-tp4117101p4117653.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud Zookeeper disconnection/reconnection
Ludovic, recent Solr changes won't do much to prevent ZK session expiry, you might want to enable GC logging on Solr and Zookeeper to check for pauses and tune appropriately. The patch below fixes a situation under which the cloud can get to a bad state during the recovery after session expiry. The recovery after a session expiry is unavoidable, but as you guessed, it would be quick if there aren't too many updates. 4.6.1 also has SOLR-5577 which will prevent updates from unnecessarily stalling when you are disconnected from ZK for a short while. These changes (and probably others) will thus probably help the cloud behave better on ZK expiry and for that reason I would encourage you to upgrade, but the ZK expiry problem would have to be dealt with ensuring that ZK and Solr don't pause for too long and by choosing an appropriate session timeout (which btw will be defaulted up to 30s from 15s in Solr 4.7 onwards). On 13 Feb 2014 08:23, lboutros boutr...@gmail.com wrote: Dear all, we are currenty using Solr 4.3.1 in production (With SolrCloud). We encounter quite the same problem described in this other old post: http://lucene.472066.n3.nabble.com/SolrCloud-CloudSolrServer-Zookeeper-disconnects-and-re-connects-with-heavy-memory-usage-consumption-td4026421.html Sometime some nodes are disconnected from Zookeeper and then they try to reconnect. The process is quite long because we have a quite long warming process. And because of this long warming process, just after the recovery process, the node is disconnected again and so on... until OOM sometime. We already increased the Zk timeout. But it is not enought. We are thinking to migrate to Solr 4.6.1 at least (perhaps 4.7 will be up before the end of the migration :) ). I know that a lot of SolrCloud bugs are corrected since Solr 4.3.1. But, could we be sure that this problem will be resolved ? Or can this problem occur with the last Solr version ? (I know this is not an easy question ;) ) It seems that this correction : Deadlock while trying to recover after a ZK session expiry : https://issues.apache.org/jira/browse/SOLR-5615 is a good point in addressing our current problem. But do you think it will be enought ? One last thing, I don't know if it is already adressed by a correction, but, if there is no updates between disconnection and the reconnection, the recovery process should not do anything more than the reconnection, I mean: no replication, no tLog replay and no warming process. Is it the case ? Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Zookeeper-disconnection-reconnection-tp4117101.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: need help in understating solr cloud stats data
We have had success with starting up Jolokia in the same servlet container as Solr, and then using its REST/Bulk API to JMX from the application of choice. On 4 Feb 2014 17:16, Walter Underwood wun...@wunderwood.org wrote: I agree that sorting and filtering stats in Solr is not a good idea. There is certainly some use in aggregation, though. One request to /admin/mbeans replaces about 50 JMX requests. Is anybody working on https://issues.apache.org/jira/browse/SOLR-4735? wunder On Feb 4, 2014, at 8:13 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: +101 for more stats. Was just saying that trying to pre-aggregate them along multiple dimensions is probably best left out of Solr. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Tue, Feb 4, 2014 at 10:49 AM, Mark Miller markrmil...@gmail.com wrote: I think that is silly. We can still offer per shard stats *and* let a user easily see stats for a collection without requiring they jump hoops or use a specific monitoring solution where someone else has already jumped hoops for them. You don't have to guess what ops people really want - *everyone* wants stats that make sense for the collections and cluster on top of the per shard stats. *Everyone* wouldn't mind seeing these without having to setup a monitoring solution first. If you want more than that, then you can fiddle with your monitoring solution. - Mark http://about.me/markrmiller On Feb 3, 2014, at 11:10 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Oh, I just saw Greg's email on dev@ about this. IMHO aggregating in the search engine is not the way to do. Leave that to external tools, which are likely to be more flexible when it comes to this. For example, our SPM for Solr can do all kinds of aggregations and filtering by a number of Solr and SolrCloud-specific dimensions already, without Solr having to do any sort of aggregation that it thinks Ops people will really want. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Mon, Feb 3, 2014 at 11:08 AM, Mark Miller markrmil...@gmail.com wrote: You should contribute that and spread the dev load with others :) We need something like that at some point, it's just no one has done it. We currently expect you to aggregate in the monitoring layer and it's a lot to ask IMO. - Mark http://about.me/markrmiller On Feb 3, 2014, at 10:49 AM, Greg Walters greg.walt...@answers.com wrote: I've had some issues monitoring Solr with the per-core mbeans and ended up writing a custom request handler that gets loaded then registers itself as an mbean. When called it polls all the per-core mbeans then adds or averages them where appropriate before returning the requested value. I'm not sure if there's a better way to get jvm-wide stats via jmx but it is *a* way to get it done. Thanks, Greg On Feb 3, 2014, at 1:33 AM, adfel70 adfe...@gmail.com wrote: I'm sending all solr stats data to graphite. I have some questions: 1. query_handler/select requestTime - if i'm looking at some metric, lets say 75thPcRequestTime - I see that each core in a single collection has different values. Is each value of each core is the time that specific core spent on a request? so to get an idea of total request time, I should summarize all the values of all the cores? 2.update_handler/commits - does this include auto_commits? becuaste I'm pretty sure I'm not doing any manual commits and yet I see a number there. 3. update_handler/docs pending - what does this mean? pending for what? for flush to disk? thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/need-help-in-understating-solr-cloud-stats-data-tp4114992.html Sent from the Solr - User mailing list archive at Nabble.com. -- Walter Underwood wun...@wunderwood.org
Re: Removing last replica from a SolrCloud collection
There's already an issue for this, https://issues.apache.org/jira/browse/SOLR-5209, we were once bitten by the same issue, when we were trying to relocate a shard. As Mark mentions, the idea was to do this in zk truth mode, the link also references where that work is being done. On 31 Jan 2014 23:10, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: Hi, If I issue either a core UNLOAD command, or a collection DELETEREPLICA command, (which both seem pretty much equivalent) it works but if there are no other replicas for the shard, then the metadata for the shard is completely gone in clusterstate.json! That's pretty disconcerting because you're basically hosed. Of course, why would I even want to do that? Well I'm experimenting with ways to restore a backed-up replica to replace existing data for the shard. If this is unexpected behavior then I'll file a bug. ~ David - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Removing-last-replica-from-a-SolrCloud-collection-tp4114772.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr limitations
I understand, thanks. I just wanted to check in case there were scalability limitations with how SolrCloud operates.. On 9 Jul 2013 12:45, Erick Erickson erickerick...@gmail.com wrote: I think Jack was mostly thinking in slam dunk terms. I know of SolrCloud demo clusters with 500+ nodes, and at that point people said it's going to work for our situation, we don't need to push more. As you start getting into that kind of scale, though, you really have a bunch of ops considerations etc. Mostly when I get into larger scales I pretty much want to examine my assumptions and see if they're correct, perhaps start to trim my requirements etc. FWIW, Erick On Tue, Jul 9, 2013 at 4:07 AM, Ramkumar R. Aiyengar andyetitmo...@gmail.com wrote: 5. No more than 32 nodes in your SolrCloud cluster. I hope this isn't too OT, but what tradeoffs is this based on? Would have thought it easy to hit this number for a big index and high load (hence with the view of both the number of shards and replicas horizontally scaling..) 6. Don't return more than 250 results on a query. None of those is a hard limit, but don't go beyond them unless your Proof of Concept testing proves that performance is acceptable for your situation. Start with a simple 4-node, 2-shard, 2-replica cluster for preliminary tests and then scale as needed. Dynamic and multivalued fields? Try to stay away from them - excepts for the simplest cases, they are usually an indicator of a weak data model. Sure, it's fine to store a relatively small number of values in a multivalued field (say, dozens of values), but be aware that you can't directly access individual values, you can't tell which was matched on a query, and you can't coordinate values between multiple multivalued fields. Except for very simple cases, multivalued fields should be flattened into multiple documents with a parent ID. Since you brought up the topic of dynamic fields, I am curious how you got the impression that they were a good technique to use as a starting point. They're fine for prototyping and hacking, and fine when used in moderation, but not when used to excess. The whole point of Solr is searching and searching is optimized within fields, not across fields, so having lots of dynamic fields is counter to the primary strengths of Lucene and Solr. And... schemas with lots of dynamic fields tend to be difficult to maintain. For example, if you wanted to ask a support question here, one of the first things we want to know is what your schema looks like, but with lots of dynamic fields it is not possible to have a simple discussion of what your schema looks like. Sure, there is something called schemaless design (and Solr supports that in 4.4), but that's very different from heavy reliance on dynamic fields in the traditional sense. Schemaless design is A-OK, but using dynamic fields for arrays of data in a single document is a poor match for the search features of Solr (e.g., Edismax searching across multiple fields.) One other tidbit: Although Solr does not enforce naming conventions for field names, and you can put special characters in them, there are plenty of features in Solr, such as the common fl parameter, where field names are expected to adhere to Java naming rules. When people start going wild with dynamic fields, it is common that they start going wild with their names as well, using spaces, colons, slashes, etc. that cannot be parsed in the fl and qf parameters, for example. Please don't go there! In short, put up a small cluster and start doing a Proof of Concept cluster. Stay within my suggested guidelines and you should do okay. -- Jack Krupansky -Original Message- From: Marcelo Elias Del Valle Sent: Monday, July 08, 2013 9:46 AM To: solr-user@lucene.apache.org Subject: Solr limitations Hello everyone, I am trying to search information about possible solr limitations I should consider in my architecture. Things like max number of dynamic fields, max number o documents in SolrCloud, etc. Does anyone know where I can find this info? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: Solr limitations
5. No more than 32 nodes in your SolrCloud cluster. I hope this isn't too OT, but what tradeoffs is this based on? Would have thought it easy to hit this number for a big index and high load (hence with the view of both the number of shards and replicas horizontally scaling..) 6. Don't return more than 250 results on a query. None of those is a hard limit, but don't go beyond them unless your Proof of Concept testing proves that performance is acceptable for your situation. Start with a simple 4-node, 2-shard, 2-replica cluster for preliminary tests and then scale as needed. Dynamic and multivalued fields? Try to stay away from them - excepts for the simplest cases, they are usually an indicator of a weak data model. Sure, it's fine to store a relatively small number of values in a multivalued field (say, dozens of values), but be aware that you can't directly access individual values, you can't tell which was matched on a query, and you can't coordinate values between multiple multivalued fields. Except for very simple cases, multivalued fields should be flattened into multiple documents with a parent ID. Since you brought up the topic of dynamic fields, I am curious how you got the impression that they were a good technique to use as a starting point. They're fine for prototyping and hacking, and fine when used in moderation, but not when used to excess. The whole point of Solr is searching and searching is optimized within fields, not across fields, so having lots of dynamic fields is counter to the primary strengths of Lucene and Solr. And... schemas with lots of dynamic fields tend to be difficult to maintain. For example, if you wanted to ask a support question here, one of the first things we want to know is what your schema looks like, but with lots of dynamic fields it is not possible to have a simple discussion of what your schema looks like. Sure, there is something called schemaless design (and Solr supports that in 4.4), but that's very different from heavy reliance on dynamic fields in the traditional sense. Schemaless design is A-OK, but using dynamic fields for arrays of data in a single document is a poor match for the search features of Solr (e.g., Edismax searching across multiple fields.) One other tidbit: Although Solr does not enforce naming conventions for field names, and you can put special characters in them, there are plenty of features in Solr, such as the common fl parameter, where field names are expected to adhere to Java naming rules. When people start going wild with dynamic fields, it is common that they start going wild with their names as well, using spaces, colons, slashes, etc. that cannot be parsed in the fl and qf parameters, for example. Please don't go there! In short, put up a small cluster and start doing a Proof of Concept cluster. Stay within my suggested guidelines and you should do okay. -- Jack Krupansky -Original Message- From: Marcelo Elias Del Valle Sent: Monday, July 08, 2013 9:46 AM To: solr-user@lucene.apache.org Subject: Solr limitations Hello everyone, I am trying to search information about possible solr limitations I should consider in my architecture. Things like max number of dynamic fields, max number o documents in SolrCloud, etc. Does anyone know where I can find this info? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: whole index in memory
In general, just increasing the cache sizes to make everything fit in memory might not always give you best results. Do keep in mind that the caches are in Java memory and that incurs the penalty of garbage collection and other housekeeping Java's memory management might have to do. Reasonably recent Solr distributions should default to memory mapping your collections on most platforms. What that means is that if you have sufficient free memory available on your server for the operating system to use, it would do the caching for you and that invariably ends up being much better in terms of performance. From that angle, its preferable to keep the caches as small as possible so that the OS has more to cache. That said, as always, YMMV. The ultimate test in all this is to try it out yourself with various configurations and see the performance differences for yourself. On 1 Jun 2013 01:34, alx...@aim.com wrote: Hello, I have a solr index of size 5GB. I am thinking of increasing cache size to 5 GB, expecting Solr will put whole index into memory. 1. Will Solr indeed put whole index into memory? 2. What are drawbacks of this approach? Thanks in advance. Alex.