Using Solr from Github or SVN
I want to branch Solr (latest version) at my local and implement some custom codes. After some time(maybe every month) I will merge my code with Solr. However There is code at SVN and Github for Solr and I see that they are not exactly synchronous. Which one do you suggest, do you think that if there is no time delay between SVN and Github repositories using Git is much better cos of merging easiness?
Re: Using Solr from Github or SVN
How about deciding on Maven or Ant + Ivy. On the other hand I need another suggestion whether using Eclipse or Intellij IDEA. What developers use in common? 2013/3/21 Jan Høydahl jan@cominvent.com See http://wiki.apache.org/solr/HowToContribute Whether you choose to work locally with a GIT checkout or SVN is up to you. At the end of the day, when you want to contribute stuff back, you'd generate a patch and attach it to JIRA. SVN is the main repo, so if you want to be 100% in sync, choose the official SVN. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 21. mars 2013 kl. 10:31 skrev Furkan KAMACI furkankam...@gmail.com: I want to branch Solr (latest version) at my local and implement some custom codes. After some time(maybe every month) I will merge my code with Solr. However There is code at SVN and Github for Solr and I see that they are not exactly synchronous. Which one do you suggest, do you think that if there is no time delay between SVN and Github repositories using Git is much better cos of merging easiness?
How can I compile and debug Solr from source code?
I use Intellij Idea 12 and Solr 4.1 on a Centos 6.4 64 bit computer. I have opened Solr source code at Intellij IDEA as explained documentation. I want to deploy Solr into Tomcat 7. When I open the project there are configurations set previosly (I used ant idea command before I open the project) . However they are all test configurations and some of them are not passed test (this is another issue, no need to go detail at this e-mail). I have added a Tomcat Local configuration into configurations but I don't know which one is the main method of Solr and is there any documentation that explains code. i.e. I want to debug a point what Solr receives from when I say -index from nutch and what Solr does? I tried somehing to run code (I don't think I could generate a .war or an exploded folder) an this is the error that I get:(I did't point any artifact for edit configurations) Error: Exception thrown by the agent : java.net.MalformedURLException: Local host name unknown: java.net.UnknownHostException: me.local: me.local: Name or service not known (me.local is the name I set when I install Centos 6.4 on my computer) Any ideas how to run source code will be nice for me.
Re: How can I compile and debug Solr from source code?
Using embedded is an option. However I see that there is a .war file inside Solr source code. So that means that I can generate a .war file and deploy it to Tomcat or something like that. My main question arises here. How can I generate a .war file from my customized Solr source code? That's why I mentioned tomcat. Any ideas? 2013/3/21 Shawn Heisey s...@elyograg.org On 3/21/2013 6:56 AM, Furkan KAMACI wrote: I use Intellij Idea 12 and Solr 4.1 on a Centos 6.4 64 bit computer. I have opened Solr source code at Intellij IDEA as explained documentation. I want to deploy Solr into Tomcat 7. When I open the project there are configurations set previosly (I used ant idea command before I open the project) . However they are all test configurations and some of them are not passed test (this is another issue, no need to go detail at this e-mail). I have added a Tomcat Local configuration into configurations but I don't know which one is the main method of Solr and is there any documentation that explains code. i.e. I want to debug a point what Solr receives from when I say -index from nutch and what Solr does? I tried somehing to run code (I don't think I could generate a .war or an exploded folder) an this is the error that I get:(I did't point any artifact for edit configurations) Error: Exception thrown by the agent : java.net.**MalformedURLException: Local host name unknown: java.net.UnknownHostException: me.local: me.local: Name or service not known There actually isn't a way to execute Solr itself, it doesn't have a main method. Solr is a servlet, so it requires a servlet container to run. The container that it ships with is jetty. You have mentioned tomcat. I don't know how you might go about running tomcat and Solr within IntelliJ. Perhaps someone else here might. The debugging instructions on the wiki for IntelliJ seem to indicate that you debug remotely and start the included jetty with some special options: http://wiki.apache.org/lucene-**java/HowtoConfigureIntelliJhttp://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ If you do figure out how to get IntelliJ to deploy directly to a locally installed tomcat, please update the wiki with the steps required. Thanks, Shawn
Re: How can I compile and debug Solr from source code?
Your mentioned suggestion is for only example application? Can I imply it to just pure Solr (I don't want to generate example application because my aim is not just debugging Solr, I want to extend it and I will debug that extended code)? 2013/3/22 Alexandre Rafalovitch arafa...@gmail.com That's nice. Can we put that on a Wiki? Or as a quick screencast? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Mar 21, 2013 at 5:42 PM, Erik Hatcher erik.hatc...@gmail.com wrote: Here's my development/debug workflow: - ant idea at the top-level to generate the IntelliJ project - cd solr; ant example - to build the full example - cd example; java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005 -jar start.jar - to launch Jetty+Solr in debug mode - set breakpoints in IntelliJ, set up a Remote run option (localhost:5005) in IntelliJ and debug pleasantly All the unit tests in Solr run very nicely in IntelliJ too, and for tight development loops, I spend my time doing that instead of running full on Solr. Erik On Mar 21, 2013, at 05:56 , Furkan KAMACI wrote: I use Intellij Idea 12 and Solr 4.1 on a Centos 6.4 64 bit computer. I have opened Solr source code at Intellij IDEA as explained documentation. I want to deploy Solr into Tomcat 7. When I open the project there are configurations set previosly (I used ant idea command before I open the project) . However they are all test configurations and some of them are not passed test (this is another issue, no need to go detail at this e-mail). I have added a Tomcat Local configuration into configurations but I don't know which one is the main method of Solr and is there any documentation that explains code. i.e. I want to debug a point what Solr receives from when I say -index from nutch and what Solr does? I tried somehing to run code (I don't think I could generate a .war or an exploded folder) an this is the error that I get:(I did't point any artifact for edit configurations) Error: Exception thrown by the agent : java.net.MalformedURLException: Local host name unknown: java.net.UnknownHostException: me.local: me.local: Name or service not known (me.local is the name I set when I install Centos 6.4 on my computer) Any ideas how to run source code will be nice for me.
Re: How can I compile and debug Solr from source code?
I mean I need that: There is a .war file shipped with Solr source code. How can I regenerate (build my code and generate a .war file) as like that? I will deploy it to Tomcat then? 2013/3/22 Furkan KAMACI furkankam...@gmail.com Your mentioned suggestion is for only example application? Can I imply it to just pure Solr (I don't want to generate example application because my aim is not just debugging Solr, I want to extend it and I will debug that extended code)? 2013/3/22 Alexandre Rafalovitch arafa...@gmail.com That's nice. Can we put that on a Wiki? Or as a quick screencast? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Mar 21, 2013 at 5:42 PM, Erik Hatcher erik.hatc...@gmail.com wrote: Here's my development/debug workflow: - ant idea at the top-level to generate the IntelliJ project - cd solr; ant example - to build the full example - cd example; java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005 -jar start.jar - to launch Jetty+Solr in debug mode - set breakpoints in IntelliJ, set up a Remote run option (localhost:5005) in IntelliJ and debug pleasantly All the unit tests in Solr run very nicely in IntelliJ too, and for tight development loops, I spend my time doing that instead of running full on Solr. Erik On Mar 21, 2013, at 05:56 , Furkan KAMACI wrote: I use Intellij Idea 12 and Solr 4.1 on a Centos 6.4 64 bit computer. I have opened Solr source code at Intellij IDEA as explained documentation. I want to deploy Solr into Tomcat 7. When I open the project there are configurations set previosly (I used ant idea command before I open the project) . However they are all test configurations and some of them are not passed test (this is another issue, no need to go detail at this e-mail). I have added a Tomcat Local configuration into configurations but I don't know which one is the main method of Solr and is there any documentation that explains code. i.e. I want to debug a point what Solr receives from when I say -index from nutch and what Solr does? I tried somehing to run code (I don't think I could generate a .war or an exploded folder) an this is the error that I get:(I did't point any artifact for edit configurations) Error: Exception thrown by the agent : java.net.MalformedURLException: Local host name unknown: java.net.UnknownHostException: me.local: me.local: Name or service not known (me.local is the name I set when I install Centos 6.4 on my computer) Any ideas how to run source code will be nice for me.
Could not load config for solrconfig.xml
I run ant idea command for Solr 4.1.0 and opened source code within Intellij IDEA 12.0.4 and I use Centos 6.4 at my 64 bit computer. I debugged JettySolrRunner (I don't know, I think this is the way to run Solt with Embedd Jetty within my Intellij IDEA.) However I get that error: SEVERE: Unable to create core: collection1 org.apache.solr.common.SolrException: Could not load config for solrconfig.xml at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:897) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:957) at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:579) at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:574) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: java.io.IOException: Can't find resource 'solrconfig.xml' in classpath or './collection1/conf/', cwd=/home/kamaci/projects/lucene-solr at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:319) at org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.java:284) at org.apache.solr.core.Config.init(Config.java:112) at org.apache.solr.core.Config.init(Config.java:82) at org.apache.solr.core.SolrConfig.init(SolrConfig.java:117) at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:894) ... 11 more What to do?
Re: How can I compile and debug Solr from source code?
Ok I run that and see that there is a .war file at /lucene-solr/solr/dist Do you know that how can I run that ant phase from Intellij without command line (there are many phases under Ant build window) On the other hand within Intellij Idea how can I auto deploy it into Tomcat. All in all I will edit configurations and it will run that ant command and deploy it to Tomcat itself? 2013/3/22 Steve Rowe sar...@gmail.com Perhaps you didn't see what I wrote earlier?: Sounds like you want 'ant dist', which will create the .war and put it into the solr/dist/ directory: PROMPT$ ant dist Steve On Mar 21, 2013, at 7:38 PM, Furkan KAMACI furkankam...@gmail.com wrote: I mean I need that: There is a .war file shipped with Solr source code. How can I regenerate (build my code and generate a .war file) as like that? I will deploy it to Tomcat then? 2013/3/22 Furkan KAMACI furkankam...@gmail.com Your mentioned suggestion is for only example application? Can I imply it to just pure Solr (I don't want to generate example application because my aim is not just debugging Solr, I want to extend it and I will debug that extended code)? 2013/3/22 Alexandre Rafalovitch arafa...@gmail.com That's nice. Can we put that on a Wiki? Or as a quick screencast? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Mar 21, 2013 at 5:42 PM, Erik Hatcher erik.hatc...@gmail.com wrote: Here's my development/debug workflow: - ant idea at the top-level to generate the IntelliJ project - cd solr; ant example - to build the full example - cd example; java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005 -jar start.jar - to launch Jetty+Solr in debug mode - set breakpoints in IntelliJ, set up a Remote run option (localhost:5005) in IntelliJ and debug pleasantly All the unit tests in Solr run very nicely in IntelliJ too, and for tight development loops, I spend my time doing that instead of running full on Solr. Erik On Mar 21, 2013, at 05:56 , Furkan KAMACI wrote: I use Intellij Idea 12 and Solr 4.1 on a Centos 6.4 64 bit computer. I have opened Solr source code at Intellij IDEA as explained documentation. I want to deploy Solr into Tomcat 7. When I open the project there are configurations set previosly (I used ant idea command before I open the project) . However they are all test configurations and some of them are not passed test (this is another issue, no need to go detail at this e-mail). I have added a Tomcat Local configuration into configurations but I don't know which one is the main method of Solr and is there any documentation that explains code. i.e. I want to debug a point what Solr receives from when I say -index from nutch and what Solr does? I tried somehing to run code (I don't think I could generate a .war or an exploded folder) an this is the error that I get:(I did't point any artifact for edit configurations) Error: Exception thrown by the agent : java.net.MalformedURLException: Local host name unknown: java.net.UnknownHostException: me.local: me.local: Name or service not known (me.local is the name I set when I install Centos 6.4 on my computer) Any ideas how to run source code will be nice for me.
Re: Could not load config for solrconfig.xml
Shoukd I create a collection1 folder as like in the example? On the other hand if I use .war tı deploy how can I resolve that problem too? 2013/3/22 Furkan KAMACI furkankam...@gmail.com I run ant idea command for Solr 4.1.0 and opened source code within Intellij IDEA 12.0.4 and I use Centos 6.4 at my 64 bit computer. I debugged JettySolrRunner (I don't know, I think this is the way to run Solt with Embedd Jetty within my Intellij IDEA.) However I get that error: SEVERE: Unable to create core: collection1 org.apache.solr.common.SolrException: Could not load config for solrconfig.xml at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:897) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:957) at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:579) at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:574) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: java.io.IOException: Can't find resource 'solrconfig.xml' in classpath or './collection1/conf/', cwd=/home/kamaci/projects/lucene-solr at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:319) at org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.java:284) at org.apache.solr.core.Config.init(Config.java:112) at org.apache.solr.core.Config.init(Config.java:82) at org.apache.solr.core.SolrConfig.init(SolrConfig.java:117) at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:894) ... 11 more What to do?
Using Solr For a Real Search Engine
If I want to use Solr in a web search engine what kind of strategies should I follow about how to run Solr. I mean I can run it via embedded jetty or use war and deploy to a container? You should consider that I will have heavy work load on my Solr.
NoSuchMethodError updateDocument
I use Solr 4.1.0 and Nutch 2.1, Java 1.7.0_17, Tomcat 7.0, Intellij IDEA 12.with a Centos 6.4 at my 64 bit computer. I run that command succesfully: bin/nutch solrindex http://localhost:8080/solr -index However when I run that command: bin/nutch solrindex http://localhost:8080/solr -reindex I get that error : Mar 22, 2013 6:48:27 PM org.apache.solr.common.SolrException log SEVERE: null:java.lang.RuntimeException: java.lang.NoSuchMethodError: org.apache.lucene.index.IndexWriter.updateDocument(Lorg/apache/lucene/index/Term;Lorg/apache/lucene/index/IndexDocument;Lorg/apache/lucene/analysis/Analyzer;)V at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:653) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:366) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:936) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1004) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.NoSuchMethodError: org.apache.lucene.index.IndexWriter.updateDocument(Lorg/apache/lucene/index/Term;Lorg/apache/lucene/index/IndexDocument;Lorg/apache/lucene/analysis/Analyzer;)V at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:201) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:451) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:587) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1812) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345) ... 16 more
Re: NoSuchMethodError updateDocument
I just indicated that JVM parameter: -Dsolr.solr.home=/home/projects/lucene-solr/solr/solr_home solr_home is where is my config files etc. stands. My solr.xml has that lines: cores adminPath=/admin/cores defaultCoreName=collection1 host=${host:} hostPort=${jetty.port:} hostContext=${hostContext:} zkClientTimeout=${zkClientTimeout:15000} core name=collection1 instanceDir=collection1/ /cores On the other hand I run it from my tomcat without using example embedded jetty start.jar. Any ideas? 2013/3/22 Furkan KAMACI furkankam...@gmail.com I use Solr 4.1.0 and Nutch 2.1, Java 1.7.0_17, Tomcat 7.0, Intellij IDEA 12.with a Centos 6.4 at my 64 bit computer. I run that command succesfully: bin/nutch solrindex http://localhost:8080/solr -index However when I run that command: bin/nutch solrindex http://localhost:8080/solr -reindex I get that error : Mar 22, 2013 6:48:27 PM org.apache.solr.common.SolrException log SEVERE: null:java.lang.RuntimeException: java.lang.NoSuchMethodError: org.apache.lucene.index.IndexWriter.updateDocument(Lorg/apache/lucene/index/Term;Lorg/apache/lucene/index/IndexDocument;Lorg/apache/lucene/analysis/Analyzer;)V at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:653) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:366) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:936) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1004) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.NoSuchMethodError: org.apache.lucene.index.IndexWriter.updateDocument(Lorg/apache/lucene/index/Term;Lorg/apache/lucene/index/IndexDocument;Lorg/apache/lucene/analysis/Analyzer;)V at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:201) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:451) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:587) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1812) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345) ... 16 more
Re: NoSuchMethodError updateDocument
Hi Jan; I will check the jar versions. By the way I think that I should create a solr home directory for my application (my application is that: I use Nutch to crawl web sites and use Solr to index them). Which folder from Solr sources code folders (maybe lucene-solr/solr/example/example-DIH/solr?) should I copy to somewhere and pass its path that is solr home as JVM parameter? And I don't know what extra changes should I do for my situation (nutch crawling and solr indexing) At solr.xml there is a field ${jetty.port:} and I didn't define a port for it? I use tomcat and it runs at 8080 and I think jetty port is 8983 that's why I think that there may be a confusing point? 2013/3/23 Jan Høydahl jan@cominvent.com Are you 100% sure you use the exact jars for 4.1.0 *everywhere*, and that you're not blending older versions from the Nutch distro in your classpath here? Any ideas? BTW: What was your question here regarding Jetty vs Tomcat? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 23. mars 2013 kl. 00:50 skrev Furkan KAMACI furkankam...@gmail.com: I just indicated that JVM parameter: -Dsolr.solr.home=/home/projects/lucene-solr/solr/solr_home solr_home is where is my config files etc. stands. My solr.xml has that lines: cores adminPath=/admin/cores defaultCoreName=collection1 host=${host:} hostPort=${jetty.port:} hostContext=${hostContext:} zkClientTimeout=${zkClientTimeout:15000} core name=collection1 instanceDir=collection1/ /cores On the other hand I run it from my tomcat without using example embedded jetty start.jar. Any ideas? 2013/3/22 Furkan KAMACI furkankam...@gmail.com I use Solr 4.1.0 and Nutch 2.1, Java 1.7.0_17, Tomcat 7.0, Intellij IDEA 12.with a Centos 6.4 at my 64 bit computer. I run that command succesfully: bin/nutch solrindex http://localhost:8080/solr -index However when I run that command: bin/nutch solrindex http://localhost:8080/solr -reindex I get that error : Mar 22, 2013 6:48:27 PM org.apache.solr.common.SolrException log SEVERE: null:java.lang.RuntimeException: java.lang.NoSuchMethodError: org.apache.lucene.index.IndexWriter.updateDocument(Lorg/apache/lucene/index/Term;Lorg/apache/lucene/index/IndexDocument;Lorg/apache/lucene/analysis/Analyzer;)V at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:653) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:366) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:936) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1004) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.NoSuchMethodError: org.apache.lucene.index.IndexWriter.updateDocument(Lorg/apache/lucene/index/Term;Lorg/apache/lucene/index/IndexDocument;Lorg/apache/lucene/analysis/Analyzer;)V at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:201) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:451) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:587) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd
NoSuchMethodError SolrIndexSearcher.doc(I)
I have just configured my Solr to index nutch crawling data. I run dist-war for Solr and when I deploy my war file from my Intellij IDEA 12.0.4 I get that severe at my logs: Mar 23, 2013 7:14:32 PM org.apache.solr.common.SolrException log SEVERE: null:java.lang.NoSuchMethodError: org.apache.solr.search.SolrIndexSearcher.doc(I)Lorg/apache/lucene/index/StoredDocument; at org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:78) at org.apache.solr.core.SolrCore$5.call(SolrCore.java:1601) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) However it is deployed and index, reindex worked, I can get result from my searchs. Whay may be the reason for that and is that true dist-war to compile my changes at solr when generating war file?
Re: NoSuchMethodError SolrIndexSearcher.doc(I)
I have indicated: -Dsolr.data.dir as a JVM parameter and error gone. 2013/3/23 Furkan KAMACI furkankam...@gmail.com I have just configured my Solr to index nutch crawling data. I run dist-war for Solr and when I deploy my war file from my Intellij IDEA 12.0.4 I get that severe at my logs: Mar 23, 2013 7:14:32 PM org.apache.solr.common.SolrException log SEVERE: null:java.lang.NoSuchMethodError: org.apache.solr.search.SolrIndexSearcher.doc(I)Lorg/apache/lucene/index/StoredDocument; at org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:78) at org.apache.solr.core.SolrCore$5.call(SolrCore.java:1601) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) However it is deployed and index, reindex worked, I can get result from my searchs. Whay may be the reason for that and is that true dist-war to compile my changes at solr when generating war file?
Re: NoSuchMethodError updateDocument
I am using: bin/nutch solrindex http://localhost:8983/solr -index bin/nutch solrindex http://localhost:8983/solr -reindex I don't get this error anymore. By the wy who sets jetty.port? 2013/3/24 Jan Høydahl jan@cominvent.com How have you setup Nutch to index to Solr? Are you running this over HTTP between two different servers? The jetty.port is a silly name, but you can rename it anything you like. Its only task is to select which port to start an embedded ZooKeeper at if you use -DzkRun. If you don't, just forget about it. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 23. mars 2013 kl. 14:34 skrev Furkan KAMACI furkankam...@gmail.com: Hi Jan; I will check the jar versions. By the way I think that I should create a solr home directory for my application (my application is that: I use Nutch to crawl web sites and use Solr to index them). Which folder from Solr sources code folders (maybe lucene-solr/solr/example/example-DIH/solr?) should I copy to somewhere and pass its path that is solr home as JVM parameter? And I don't know what extra changes should I do for my situation (nutch crawling and solr indexing) At solr.xml there is a field ${jetty.port:} and I didn't define a port for it? I use tomcat and it runs at 8080 and I think jetty port is 8983 that's why I think that there may be a confusing point? 2013/3/23 Jan Høydahl jan@cominvent.com Are you 100% sure you use the exact jars for 4.1.0 *everywhere*, and that you're not blending older versions from the Nutch distro in your classpath here? Any ideas? BTW: What was your question here regarding Jetty vs Tomcat? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 23. mars 2013 kl. 00:50 skrev Furkan KAMACI furkankam...@gmail.com: I just indicated that JVM parameter: -Dsolr.solr.home=/home/projects/lucene-solr/solr/solr_home solr_home is where is my config files etc. stands. My solr.xml has that lines: cores adminPath=/admin/cores defaultCoreName=collection1 host=${host:} hostPort=${jetty.port:} hostContext=${hostContext:} zkClientTimeout=${zkClientTimeout:15000} core name=collection1 instanceDir=collection1/ /cores On the other hand I run it from my tomcat without using example embedded jetty start.jar. Any ideas? 2013/3/22 Furkan KAMACI furkankam...@gmail.com I use Solr 4.1.0 and Nutch 2.1, Java 1.7.0_17, Tomcat 7.0, Intellij IDEA 12.with a Centos 6.4 at my 64 bit computer. I run that command succesfully: bin/nutch solrindex http://localhost:8080/solr -index However when I run that command: bin/nutch solrindex http://localhost:8080/solr -reindex I get that error : Mar 22, 2013 6:48:27 PM org.apache.solr.common.SolrException log SEVERE: null:java.lang.RuntimeException: java.lang.NoSuchMethodError: org.apache.lucene.index.IndexWriter.updateDocument(Lorg/apache/lucene/index/Term;Lorg/apache/lucene/index/IndexDocument;Lorg/apache/lucene/analysis/Analyzer;)V at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:653) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:366) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:936) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1004) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.NoSuchMethodError: org.apache.lucene.index.IndexWriter.updateDocument(Lorg/apache/lucene/index/Term;Lorg/apache/lucene/index/IndexDocument;Lorg
Re: Recommendation for integration test framework
Unrelated about your question you said that: We are utilizing Apache Maven as build management tool I think currently ant + ivy is build and dependency management tools, maven pom is generated via plugin (If I am wrong you can correct it). Are there any plan to move the project based on Maven? 2013/3/25 Jan Morlock jan.morl...@googlemail.com Hi, our solr implementation consists of several cores sometimes interacting with each other. Using SolrTestCaseJ4 didn't work out for us. Instead we would like to test the resulting war from outside using integration tests. We are utilizing Apache Maven as build management tool. Therefore we are currently thinking about using the maven failsafe plugin. Does anybody have experiences with using it in combination with solr? Or does somebody have a better recommendation for us? Thank you very much in advance Jan -- View this message in context: http://lucene.472066.n3.nabble.com/Recommendation-for-integration-test-framework-tp4050936.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: multicore vs multi collection
Did you check that document: http://wiki.apache.org/solr/SolrCloud#A_little_about_SolrCores_and_CollectionsIt says: On a single instance, Solr has something called a SolrCorehttp://wiki.apache.org/solr/SolrCorethat is essentially a single index. If you want multiple indexes, you create multiple SolrCores http://wiki.apache.org/solr/SolrCores. With SolrCloud, a single index can span multiple Solr instances. This means that a single index can be made up of multiple SolrCorehttp://wiki.apache.org/solr/SolrCore's on different machines. We call all of these SolrCoreshttp://wiki.apache.org/solr/SolrCoresthat make up one logical index a collection. A collection is a essentially a single index that spans many SolrCorehttp://wiki.apache.org/solr/SolrCore's, both for index scaling as well as redundancy. If you wanted to move your 2 SolrCore http://wiki.apache.org/solr/SolrCore Solr setup to SolrCloud, you would have 2 collections, each made up of multiple individual SolrCoreshttp://wiki.apache.org/solr/SolrCores. 2013/3/26 J Mohamed Zahoor zah...@indix.com Hi I am kind of confuzed between multi core and multi collection. Docs dont seem to clarify this.. can someone enlighten me what is ther difference between a core and a collection? Are they same? ./zahoor
Debugging Map Reduce Jobs at Solr
Is there any easy way(tools etc.) that I can debug Map Reduce jobs of Solr?
Re: Debugging Map Reduce Jobs at Solr
Ok, thanks for your responses. Actually I was wondering about indexing and reindexing from nutch to Solr and debugging them. I think according to your responses there is no difference for Solr side that data is coming through a map reduce or not. 2013/3/26 Otis Gospodnetic otis.gospodne...@gmail.com Hi, Solr doesn't really do MapReduce jobs. Maybe you mean distributed search where queries are dispatched to N servers and then responses are merged/reduced to top N and returned? Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Mar 26, 2013 at 6:34 AM, Furkan KAMACI furkankam...@gmail.com wrote: Is there any easy way(tools etc.) that I can debug Map Reduce jobs of Solr?
There are no SolrCores running. Using the Solr Admin UI currently requires at least one SolrCore.
I use Solr 4.2 on Centos 6.4 at AWS and I have deployed solr wars into two different amazon instances at tomcats. *When I run them without solrcloud they are OK.* However I want to use them as solrCloud. I want to start embedded zookeper at one of them. When I run: ps aux | grep catalina I get that: /usr/java/default/bin/java -Djava.util.logging.config.file=/usr/share/tomcat/conf/logging.properties -Dbootstrap_confdir=/usr/share/solrhome/collection1/conf -Dcollection.configName=custom_conf -DnumShards=2 -DzkRun -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.endorsed.dirs=/usr/share/tomcat/endorsed -classpath /usr/share/tomcat/bin/bootstrap.jar:/usr/share/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/usr/share/tomcat -Dcatalina.home=/usr/share/tomcat -Djava.io.tmpdir=/usr/share/tomcat/temp org.apache.catalina.startup.Bootstrap start solrhome is my home of solr. my solr.xml has that: cores adminPath=/admin/cores defaultCoreName=collection1 host=${host:} hostPort=${jetty.port:8080} hostContext=${hostContext:search} zkClientTimeout=${zkClientTimeout:15000} core name=collection1 instanceDir=collection1 / /cores When I open webpage I get that error: * There are no SolrCores running. Using the Solr Admin UI currently requires at least one SolrCore.* When I look catalina.out I see that: Mar 26, 2013 8:54:35 PM org.apache.solr.cloud.ZkController publish INFO: publishing core=collection1 state=down Mar 26, 2013 8:54:35 PM org.apache.solr.cloud.ZkController publish INFO: numShards not found on descriptor - reading it from system property Mar 26, 2013 8:54:36 PM org.apache.solr.common.cloud.ZkStateReader updateClusterState INFO: Updating cloud state from ZooKeeper... Mar 26, 2013 8:54:36 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater updateState INFO: Update state numShards=2 message={ operation:state, core_node_name:null, numShards:2, shard:null, roles:null, state:down, core:collection1, collection:collection1, node_name:**.**.***.**:8080_search,// I have put * as ip base_url:http://**.**.***.**:8080/search} // I have put * as ip Mar 26, 2013 8:54:36 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater createCollection INFO: Create collection collection1 with numShards 2 Mar 26, 2013 8:54:36 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater updateState INFO: Assigning new node to shard shard=shard1 Mar 26, 2013 8:54:36 PM org.apache.zookeeper.server.NIOServerCnxnFactory$1 uncaughtException SEVERE: Thread Thread[Thread-3,5,Overseer state updater.] died java.lang.NoSuchMethodError: org.apache.solr.common.cloud.SolrZkClient.setData(Ljava/lang/String;[BZ)Lorg/apache/zookeeper/data/Stat; at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:144) at java.lang.Thread.run(Thread.java:722) Mar 26, 2013 8:59:55 PM org.apache.solr.common.SolrException log SEVERE: null:org.apache.solr.common.SolrException: Could not get shard_id for core: collection1 coreNodeName:10.36.163.29:8080_search_collection1 at org.apache.solr.cloud.ZkController.doGetShardIdProcess(ZkController.java:1221) at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1290) at org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:861) at org.apache.solr.core.CoreContainer.register(CoreContainer.java:841) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:638) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Mar 26, 2013 8:59:55 PM org.apache.solr.core.SolrCore close INFO: [collection1] CLOSING SolrCore org.apache.solr.core.SolrCore@64e5472e Mar 26, 2013 8:59:55 PM org.apache.solr.update.DirectUpdateHandler2 close INFO: closing DirectUpdateHandler2{commits=0,autocommit maxTime=15000ms,autocommits=0,soft autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0} Mar 26, 2013 8:59:55 PM org.apache.solr.update.SolrCoreState decrefSolrCoreState INFO: Closing SolrCoreState Mar 26, 2013 8:59:56 PM org.apache.catalina.startup.Catalina start INFO: Server startup in 327928 ms Mar 26, 2013 8:59:57 PM org.apache.solr.servlet.SolrDispatchFilter handleAdminRequest INFO:
Re: There are no SolrCores running. Using the Solr Admin UI currently requires at least one SolrCore.
Yes, I cleaned and compiled with ant again and fixed. Because there were some other jars at my lib somehow. How could do understand that there is mix of jars? Just because of NoSuchMethodError or with something else? 2013/3/26 Mark Miller markrmil...@gmail.com java.lang.NoSuchMethodError: There must be something off with the jars you are using - a mix of versions or something. - Mark On Mar 26, 2013, at 5:18 PM, Furkan KAMACI furkankam...@gmail.com wrote: I use Solr 4.2 on Centos 6.4 at AWS and I have deployed solr wars into two different amazon instances at tomcats. *When I run them without solrcloud they are OK.* However I want to use them as solrCloud. I want to start embedded zookeper at one of them. When I run: ps aux | grep catalina I get that: /usr/java/default/bin/java -Djava.util.logging.config.file=/usr/share/tomcat/conf/logging.properties -Dbootstrap_confdir=/usr/share/solrhome/collection1/conf -Dcollection.configName=custom_conf -DnumShards=2 -DzkRun -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.endorsed.dirs=/usr/share/tomcat/endorsed -classpath /usr/share/tomcat/bin/bootstrap.jar:/usr/share/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/usr/share/tomcat -Dcatalina.home=/usr/share/tomcat -Djava.io.tmpdir=/usr/share/tomcat/temp org.apache.catalina.startup.Bootstrap start solrhome is my home of solr. my solr.xml has that: cores adminPath=/admin/cores defaultCoreName=collection1 host=${host:} hostPort=${jetty.port:8080} hostContext=${hostContext:search} zkClientTimeout=${zkClientTimeout:15000} core name=collection1 instanceDir=collection1 / /cores When I open webpage I get that error: * There are no SolrCores running. Using the Solr Admin UI currently requires at least one SolrCore.* When I look catalina.out I see that: Mar 26, 2013 8:54:35 PM org.apache.solr.cloud.ZkController publish INFO: publishing core=collection1 state=down Mar 26, 2013 8:54:35 PM org.apache.solr.cloud.ZkController publish INFO: numShards not found on descriptor - reading it from system property Mar 26, 2013 8:54:36 PM org.apache.solr.common.cloud.ZkStateReader updateClusterState INFO: Updating cloud state from ZooKeeper... Mar 26, 2013 8:54:36 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater updateState INFO: Update state numShards=2 message={ operation:state, core_node_name:null, numShards:2, shard:null, roles:null, state:down, core:collection1, collection:collection1, node_name:**.**.***.**:8080_search,// I have put * as ip base_url:http://**.**.***.**:8080/search} // I have put * as ip Mar 26, 2013 8:54:36 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater createCollection INFO: Create collection collection1 with numShards 2 Mar 26, 2013 8:54:36 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater updateState INFO: Assigning new node to shard shard=shard1 Mar 26, 2013 8:54:36 PM org.apache.zookeeper.server.NIOServerCnxnFactory$1 uncaughtException SEVERE: Thread Thread[Thread-3,5,Overseer state updater.] died java.lang.NoSuchMethodError: org.apache.solr.common.cloud.SolrZkClient.setData(Ljava/lang/String;[BZ)Lorg/apache/zookeeper/data/Stat; at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:144) at java.lang.Thread.run(Thread.java:722) Mar 26, 2013 8:59:55 PM org.apache.solr.common.SolrException log SEVERE: null:org.apache.solr.common.SolrException: Could not get shard_id for core: collection1 coreNodeName:10.36.163.29:8080_search_collection1 at org.apache.solr.cloud.ZkController.doGetShardIdProcess(ZkController.java:1221) at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1290) at org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:861) at org.apache.solr.core.CoreContainer.register(CoreContainer.java:841) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:638) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Mar 26, 2013 8:59:55 PM org.apache.solr.core.SolrCore close INFO: [collection1] CLOSING
SolrCloud On Different AWS Instances With Embedded Zookeeper
I have to Amazon Web Services instances. I have set up SolrCloud for them. Solr .wars are deployed into tomcat. When I start solr that runs zookeper, it is OK. It can not find second shard as usual. When I start up second solr it throws error. This is first solr config: JAVA_OPTS=$JAVA_OPTS -Dbootstrap_confdir=/usr/share/solr_home/collection1/conf -Dcollection.configName=custom_conf -DnumShards=2 -DzkRun This is for second one: JAVA_OPTS=$JAVA_OPTS -DzkHost=**.**.***.**:9080 // Ihave masked ip This is the error that I get at catalina.out: Mar 26, 2013 10:42:14 PM org.apache.zookeeper.ClientCnxn$SendThread logStartConnect INFO: Opening socket connection to server ip-**-**-***-**.eu-west-1.compute.internal/**.**.***.**:9080. Will not attempt to authenticate using SASL (unknown error) Mar 26, 2013 10:42:14 PM org.apache.zookeeper.ClientCnxn$SendThread run WARNING: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:692) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068) And the things that I am confused: 1) I didn't define a -DzkHost for first one. 2) First solr runs on 8080 and I have added +1000 and set 9080 for second one's -DzkHost. I should learn that do I need to define that 8080 at first one at anywhere else. Any ideas?
Re: multicore vs multi collection
Also from there http://wiki.apache.org/solr/SolrCloud: *Q:* What is the difference between a Collection and a SolrCorehttp://wiki.apache.org/solr/SolrCore? *A:* In classic single node Solr, a SolrCorehttp://wiki.apache.org/solr/SolrCoreis basically equivalent to a Collection. It presents one logical index. In SolrCloud, the SolrCore http://wiki.apache.org/solr/SolrCore's on multiple nodes form a Collection. This is still just one logical index, but multiple SolrCores http://wiki.apache.org/solr/SolrCores host different 'shards' of the full collection. So a SolrCorehttp://wiki.apache.org/solr/SolrCoreencapsulates a single physical index on an instance. A Collection is a combination of all of the SolrCores http://wiki.apache.org/solr/SolrCoresthat together provide a logical index that is distributed across many nodes. 2013/3/26 J Mohamed Zahoor zah...@indix.com Thanks. This make it clear than the wiki. How do you create multiple collection which can have different schema? ./zahoor On 26-Mar-2013, at 3:52 PM, Furkan KAMACI furkankam...@gmail.com wrote: Did you check that document: http://wiki.apache.org/solr/SolrCloud#A_little_about_SolrCores_and_CollectionsIt says: On a single instance, Solr has something called a SolrCorehttp://wiki.apache.org/solr/SolrCorethat is essentially a single index. If you want multiple indexes, you create multiple SolrCores http://wiki.apache.org/solr/SolrCores. With SolrCloud, a single index can span multiple Solr instances. This means that a single index can be made up of multiple SolrCorehttp://wiki.apache.org/solr/SolrCore's on different machines. We call all of these SolrCoreshttp://wiki.apache.org/solr/SolrCoresthat make up one logical index a collection. A collection is a essentially a single index that spans many SolrCorehttp://wiki.apache.org/solr/SolrCore's, both for index scaling as well as redundancy. If you wanted to move your 2 SolrCore http://wiki.apache.org/solr/SolrCore Solr setup to SolrCloud, you would have 2 collections, each made up of multiple individual SolrCoreshttp://wiki.apache.org/solr/SolrCores. 2013/3/26 J Mohamed Zahoor zah...@indix.com Hi I am kind of confuzed between multi core and multi collection. Docs dont seem to clarify this.. can someone enlighten me what is ther difference between a core and a collection? Are they same? ./zahoor
Re: Loadtesting solr/tomcat7 and tomcat stops responding entirely
Hi Nate; This may be out of topic however could you explain that why you want to use Tomcat instead of Jetty or Embedded Jetty? 2013/3/27 Michael Della Bitta michael.della.bi...@appinions.com You're using the blocking IO connector, which isn't so great for heavy loads. Give this a shot... You'll end up with 8192 max connections by default, although this is tunable too: Run: apt-get install libapr1 libtcnative-1 Add this to the list of Listeners at the top of server.xml: Listener className=org.apache.catalina.core.AprLifecycleListener SSLEngine=off / These instructions assume you're running Tomcat 6 or 7. Here's some documentation: http://tomcat.apache.org/tomcat-7.0-doc/apr.html http://tomcat.apache.org/tomcat-7.0-doc/config/http.html Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Tue, Mar 26, 2013 at 5:31 PM, Nate Fox n...@neogov.com wrote: We're not using ELB and I have no idea which connector I'm using - I'm guessing whatever is default (I'm a total noob). This is from my server.xml: Connector port=8080 protocol=HTTP/1.1 connectionTimeout=6 URIEncoding=UTF-8 redirectPort=8443 / -- Nate Fox Sr Systems Engineer o: 310.658.5775 m: 714.248.5350 Follow us @NEOGOV http://twitter.com/NEOGOV and on Facebookhttp://www.facebook.com/neogov NEOGOV http://www.neogov.com/ is among the top fastest growing software companies in the USA, recognized by Inc 500|5000, Deloitte Fast 500, and the LA Business Journal. We are hiring! http://www.neogov.com/#/company/careers On Tue, Mar 26, 2013 at 1:02 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Nate, We just cleared up a problem similar to this by ditching Elastic Load Balancer and switching over to the APR connector in Tomcat. Are you using either of those? Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Tue, Mar 26, 2013 at 2:58 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Nate, Try adding some warmup queries and making sure the setting for using the cold searcher in solrconfig.xml is set to false. Your warmup queries should use facets and sorting if your normal queries use them. In SPM you'll actually see how much time warming up takes, so you'll get a better idea of the cost of that (when you don't do it). Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Mar 26, 2013 at 2:50 PM, Nate Fox n...@neogov.com wrote: I was wondering if the warmup stuff was one of the culprits (we dont have warmup's at all - the configs are pretty stock). As for the system, it seems capable of quite a bit more: memory usage is ~30%, jvm-memory (from the dashboard) is very low (~220Mb out of 3Gb) and load below 1.00. The seed data and queries were put together by one of our developers. I've put all the solrmeter files here: https://gist.github.com/natefox/ee5cef3d4fbbc73e9bce Unfortunately I'm quite new to solr (and tomcat) so I'm not entirely sure which file does which specifically. Does the system's reaction to a 'fast load' without a warmup sound normal? I would have expected the first couple hundred queries to be very slow (500ms) and then the system catch up after a while. But it just dies very quickly and never recovers. I'll check out your SPM - I've seen it mentioned before. Thanks! -- Nate Fox Sr Systems Engineer o: 310.658.5775 m: 714.248.5350 Follow us @NEOGOV http://twitter.com/NEOGOV and on Facebookhttp://www.facebook.com/neogov NEOGOV http://www.neogov.com/ is among the top fastest growing software companies in the USA, recognized by Inc 500|5000, Deloitte Fast 500, and the LA Business Journal. We are hiring! http://www.neogov.com/#/company/careers On Tue, Mar 26, 2013 at 11:12 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, In short, certain data structures need to load from index in the beginning, (for sorting and faceting) caches need to warm up, JVM needs to warm up, etc., so going slowly in the beginning makes sense. Why things die after that is a different Q. Maybe it OOMs? Maybe queries are very complex? What do your queries look like? I see newrelic.jar in the command-line. May want to try SPM for Solr, it has better Solr metrics. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Mar 26, 2013 at 1:24 PM, Nate Fox n...@neogov.com wrote: I'm new to solr and I'm load testing our setup to see what we can handle.
Re: Setup solrcloud on tomcat
First of all, can you check your catalina.out log. It gives the detail about what is wrong. Secondly you can separate such kind of JVM parameters from that solr.xml and put them into a file setenv.sh (you will create it under bin folder of tomcat.) and here is what you should do: #!/bin/sh JAVA_OPTS=$JAVA_OPTS -Dbootstrap_confdir=/usr/share/solrhome/collection1/conf -Dcollection.configName=custom_conf -DnumShards=2 -DzkRun export JAVA_OPTS You should change here - /usr/share/solrhome into where is your solr home. That should start up an embedded zookeper. On the other hand client that will connect to embedded zookeper should have that setenv.sh: #!/bin/sh JAVA_OPTS=$JAVA_OPTS -DzkHost=**.**.***.**:2181 export JAVA_OPTS I have masked ip address, you should put your's. 2013/3/28 하정대 jungdae...@ahnlab.com Hi, all I tried setup solrcloud on tomcat. But I couldn’t see the cloud bar on solr menu. I think embedded zookeeper might not be loaded. This is my solr.xml file that was supposed to run zookeeper. solr persistent=”true” cores adminPath=”/admin/cores” defaultCoreName=”collection1” host=”${host:}” hostPort=”8080” hostContext=”${hostContext:}” numShards=”2” zkRun=http://localhost:9081 zkClientTimeout=”${zkClientTimeout:15000}” core name=”collection1” instanceDir=”collection1” / /cores /solr What shall I have? I need your help. Also, Example file or tutorial could be a good help for me. I am working this with solrcloud wiki. Thanks. All. “세상에서 가장 안전한 이름 - 안철수연구소” 하정대, 선임연구원 / ASD실 Tel: 031-722-8338 e-mail: jungdae...@ahnlab.com http://www.ahnlab.com http://www.ahnlab.com/ (우)463-400 경기도 성남시 분당구 삼평동 673번지
Combining Solr Indexes at SolrCloud
Let's assume that I have two machine in a SolrCloud that works as a part of cloud. If I want to shutdown one of them an combine its indexes into other how can I do that?
SOAP for Solr indexing mechanism
Is there any support for communication over SOAP for Solr indexing mechanism?
Parallel Indexing With Solr?
Does Solr allows parallelism (parallel computing) for indexing?
Suggestions for Customizing Solr Admin Page
I want to customize Solr Admin Page. I think that I will need more complicated things to manage my cloud. I will separate my Solr cluster into just indexing ones and just response ones. I will index my documents by categorical and I will index them at different collections. In my admin page I will combine that collections, I will separate my collection into new ones. I will add, remove, query documents etc. Here is an old topic about admin Solr page: http://lucene.472066.n3.nabble.com/Extending-Solr-s-Admin-functionality-td473974.html My needs my change and some of them should be done via existing Solr admin page. What do you suggest me, extending existing admin page, wrapping up a new one over a Solrj. Which directions should I care and how can I decide one of them.
Re: Parallel Indexing With Solr?
Can you tell more about You can index from a MapReduce job ? I use nutch and it says Solr to index and reindex. I know that I can use Map Reduce jobs at nutch side however can I use Map Reduce jobs at Solr side (i.e for indexing etc.)? 2013/3/29 Otis Gospodnetic otis.gospodne...@gmail.com Yes. You can index from any app that can hit SOlr with multiple threads. You can use StreamingUpdateSolrServer, at least in older Solrs, to handle multi-threading for you. You can index from a MapReduce job Otis -- Solr ElasticSearch Support http://sematext.com/ On Fri, Mar 29, 2013 at 5:26 AM, Furkan KAMACI furkankam...@gmail.com wrote: Does Solr allows parallelism (parallel computing) for indexing?
Filtering Search Cloud
I want to separate my cloud into two logical parts. One of them is indexer cloud of SolrCloud. Second one is Searcher cloud of SolrCloud. My first question is that. Does separating my cloud system make sense about performance improvement. Because I think that when indexing, searching make time to response and if I separate them I get a performance improvement. On the other hand maybe using all Solr machines as whole (I mean not partitioning as I mentioned) SolrCloud can make a better load balancing, I would want to learn it. My second question is that. Let's assume that I have separated my machines as I mentioned. Can I filter some indexes to be searchable or not from Searcher SolrCloud?
Re: Flow Chart of Solr
Actually maybe one the most important core thing is that Analysis part at last diagram but there is nothing about it i.e. stamming, lemmitazing etc. at any of them. 2013/4/2 Andre Bois-Crettez andre.b...@kelkoo.com On 04/02/2013 04:20 PM, Koji Sekiguchi wrote: (13/04/02 21:45), Furkan KAMACI wrote: Is there any documentation something like flow chart of Solr. i.e. Documents comes into Solr(maybe indicating which classes get documents) and goes to parsing process (i.e. stemming processes etc.) and then reverse indexes are get so on so forth? There is an interesting ticket: Architecture Diagrams needed for Lucene, Solr and Nutch https://issues.apache.org/**jira/browse/LUCENE-2412https://issues.apache.org/jira/browse/LUCENE-2412 koji I like this one, it is a bit more detailed : http://www.cominvent.com/2011/**04/04/solr-architecture-**diagram/http://www.cominvent.com/2011/04/04/solr-architecture-diagram/ -- André Bois-Crettez Search technology, Kelkoo http://www.kelkoo.com/ Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: Flow Chart of Solr
You are right about mentioning developer doc and user doc. Users separate about it. Some of them uses Solr for indexing and monitoring via admin face and that is quietly enough for them however some people wants to modify it so it would be nice if there had been some documentation for developer side too. 2013/4/2 Yago Riveiro yago.rive...@gmail.com For beginners is complicate understand the complexity of solr / lucene, I'm trying devel a custom search component and it's too hard keep in mind the flow, inheritance and iteration between classes. I think that there is a gap between software doc and user doc, or maybe I don't search enough T_T. Java doc not always is clear always. The fact that I'm beginner in solr world don't help. Either way, this thread was very helpful, I found some very good resources here :) Cumprimentos -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Tuesday, April 2, 2013 at 3:51 PM, Furkan KAMACI wrote: Actually maybe one the most important core thing is that Analysis part at last diagram but there is nothing about it i.e. stamming, lemmitazing etc. at any of them. 2013/4/2 Andre Bois-Crettez andre.b...@kelkoo.com (mailto: andre.b...@kelkoo.com) On 04/02/2013 04:20 PM, Koji Sekiguchi wrote: (13/04/02 21:45), Furkan KAMACI wrote: Is there any documentation something like flow chart of Solr. i.e. Documents comes into Solr(maybe indicating which classes get documents) and goes to parsing process (i.e. stemming processes etc.) and then reverse indexes are get so on so forth? There is an interesting ticket: Architecture Diagrams needed for Lucene, Solr and Nutch https://issues.apache.org/**jira/browse/LUCENE-2412 https://issues.apache.org/jira/browse/LUCENE-2412 koji I like this one, it is a bit more detailed : http://www.cominvent.com/2011/**04/04/solr-architecture-**diagram/ http://www.cominvent.com/2011/04/04/solr-architecture-diagram/ -- André Bois-Crettez Search technology, Kelkoo http://www.kelkoo.com/ Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: [ANNOUNCE] Solr wiki editing change
Hi; Please add FurkanKAMACI to the group. Thanks; Furkan KAMACI 2013/4/2 Steve Rowe sar...@gmail.com On Apr 2, 2013, at 11:23 AM, Ryan Ernst r...@iernst.net wrote: Please add RyanErnst to the contributors group. Thanks! Added to solr wiki ContributorsGroup.
Re: Flow Chart of Solr
I think about myself as an example. I have started to make research about Solr just for some weeks. I have learned Solr and its related projects. My next step writing down the main steps Solr. We have separated learning curve of Solr into two main categories. First one is who are using it as out of the box components. Second one is developer side. Actually developer side branches into two way. First one is general steps of it. i.e. document comes into Solr (i.e. crawled data of Nutch). which analyzing processes are going to done (stamming, hamming etc.), what will be doing after parsing step by step. When a search query happens what happens step by step, at which step scores are calculated so on so forth. Second one is more code specific i.e. which handlers takes into account data that will going to be indexed(no need the explain every handler at this step) . Which are the analyzer, tokenizer classes and what are the flow between them. How response handlers works and what are they. Also explaining about cloud side is other work. Some of explanations are currently presents at wiki (but some of them are at very deep places at wiki and it is not easy to find the parent topic of it, maybe starting wiki from a top age and branching all other topics as possible as from it could be better) If we could show the big picture, and beside of it the smaller pictures within it, it would be great (if you know the main parts it will be easy to go deep into the code i.e. you don't need to explain every handler, if you show the way to the developer he/she could debug and find the needs) When I think about myself as an example, I have to write down the steps of Solr a bit detail even I read many pages at wiki and a book about it, I see that it is not easy even writing down the big picture of developer side. 2013/4/2 Alexandre Rafalovitch arafa...@gmail.com Yago, My point - perhaps lost in too much text - was that Solr is presented - and can function - as a black-box. Which makes it different from more traditional open-source project. So, the stage-2 happens exactly when the non-programmers have to cross the boundary from the black-box into code-first approach and the hand-off is not particularly smooth. Or even when - say - php or .Net programmer tries to get beyond the basic operations their client library and has the understand the server-side aspects of Solr. Regards, Alex. On Tue, Apr 2, 2013 at 1:19 PM, Yago Riveiro yago.rive...@gmail.com wrote: Alexandre, You describe the normal path when a beginner try to use a source of code that doesn't understand, black-box, reading code, hacking, ok now I know 10% of the project, with lucky :p. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Flow Chart of Solr
So, all in all, is there anybody who can write down just main steps of Solr(including parsing, stemming etc.)? 2013/4/2 Furkan KAMACI furkankam...@gmail.com I think about myself as an example. I have started to make research about Solr just for some weeks. I have learned Solr and its related projects. My next step writing down the main steps Solr. We have separated learning curve of Solr into two main categories. First one is who are using it as out of the box components. Second one is developer side. Actually developer side branches into two way. First one is general steps of it. i.e. document comes into Solr (i.e. crawled data of Nutch). which analyzing processes are going to done (stamming, hamming etc.), what will be doing after parsing step by step. When a search query happens what happens step by step, at which step scores are calculated so on so forth. Second one is more code specific i.e. which handlers takes into account data that will going to be indexed(no need the explain every handler at this step) . Which are the analyzer, tokenizer classes and what are the flow between them. How response handlers works and what are they. Also explaining about cloud side is other work. Some of explanations are currently presents at wiki (but some of them are at very deep places at wiki and it is not easy to find the parent topic of it, maybe starting wiki from a top age and branching all other topics as possible as from it could be better) If we could show the big picture, and beside of it the smaller pictures within it, it would be great (if you know the main parts it will be easy to go deep into the code i.e. you don't need to explain every handler, if you show the way to the developer he/she could debug and find the needs) When I think about myself as an example, I have to write down the steps of Solr a bit detail even I read many pages at wiki and a book about it, I see that it is not easy even writing down the big picture of developer side. 2013/4/2 Alexandre Rafalovitch arafa...@gmail.com Yago, My point - perhaps lost in too much text - was that Solr is presented - and can function - as a black-box. Which makes it different from more traditional open-source project. So, the stage-2 happens exactly when the non-programmers have to cross the boundary from the black-box into code-first approach and the hand-off is not particularly smooth. Or even when - say - php or .Net programmer tries to get beyond the basic operations their client library and has the understand the server-side aspects of Solr. Regards, Alex. On Tue, Apr 2, 2013 at 1:19 PM, Yago Riveiro yago.rive...@gmail.com wrote: Alexandre, You describe the normal path when a beginner try to use a source of code that doesn't understand, black-box, reading code, hacking, ok now I know 10% of the project, with lucky :p. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Filtering Search Cloud
Shawn, thanks for your detailed explanation. My system will work on high load. I mean I will always index something and something always will be queried at my system. That is why I consider about physically separating indexer and query reply machines. I think about that: imagine a machine that both does indexing (a disk IO for it, I don't know the underlying system maybe Solr makes a sequential IO) and both trying to reply queries (another kind of IO) That is my main challenge to decide separating them. And the next step is that, if I separate them before response can I filter the data of indexer machines (I don't have any filtering issues right now, I just think that maybe I can need it at future) 2013/4/3 Shawn Heisey s...@elyograg.org On 4/1/2013 3:02 PM, Furkan KAMACI wrote: I want to separate my cloud into two logical parts. One of them is indexer cloud of SolrCloud. Second one is Searcher cloud of SolrCloud. My first question is that. Does separating my cloud system make sense about performance improvement. Because I think that when indexing, searching make time to response and if I separate them I get a performance improvement. On the other hand maybe using all Solr machines as whole (I mean not partitioning as I mentioned) SolrCloud can make a better load balancing, I would want to learn it. My second question is that. Let's assume that I have separated my machines as I mentioned. Can I filter some indexes to be searchable or not from Searcher SolrCloud? SolrCloud gets rid of the master and slave designations. It also gets rid of the line between indexing and querying. Each shard has a replica that is designated the leader, but that has no real impact on searching and indexing, only on deciding which data to use when replicas get out of sync. In the old master-slave architecture, you indexed to the master and the updated index files were replicated to the slave. The slave did not handle the analysis for indexing, so it was usually better to send queries to slaves and let the master only do indexing. SolrCloud is very different. When you index, the documents are indexed on all replicas at about the same time. When you query, the requests are load balanced across all replicas. During normal operation, SolrCloud does not use replication at all. The replication feature is only used when a replica gets out of sync with the leader, and in that case, the entire index is replicated. Thanks, Shawn
Re: Filtering Search Cloud
Thanks for your explanation, you explained every thing what I need. Just one more question. I see that I can not make it with Solr Cloud, but I can do something like that with master-slave replication of Solr. If I use master-slave replication of Solr, can I eliminate (filter) something (something that is indexed from master) from being a response after querying (querying from slaves) ? 2013/4/3 Shawn Heisey s...@elyograg.org On 4/3/2013 1:13 PM, Furkan KAMACI wrote: Shawn, thanks for your detailed explanation. My system will work on high load. I mean I will always index something and something always will be queried at my system. That is why I consider about physically separating indexer and query reply machines. I think about that: imagine a machine that both does indexing (a disk IO for it, I don't know the underlying system maybe Solr makes a sequential IO) and both trying to reply queries (another kind of IO) That is my main challenge to decide separating them. And the next step is that, if I separate them before response can I filter the data of indexer machines (I don't have any filtering issues right now, I just think that maybe I can need it at future) We do seem to have a language barrier, so let me try to be very clear: If you use SolrCloud, you can't separate querying and indexing. You will have to use the master-slave replication that been part of Solr since at least 1.4, possibly earlier. Thanks, Shawn
Difference Between Indexing and Reindexing
OK, This could be a so easy question but I want to learn just a bit more technical detail of it. When I use Nutch to send documents to Solr to be indexing there are two parameters: -index and -reindex. What Solr does at each one different from the other one?
Re: Difference Between Indexing and Reindexing
Hi Otis, then what is the difference between add and update? And how we update or add documents into Solr (I see that there is just one update handler)? 2013/4/4 Otis Gospodnetic otis.gospodne...@gmail.com I don't recall what Nutch does, so it's hard to tell. In Solr (Lucene, really), you can: * add documents * update documents * delete documents Currently, update is really a delete + readd under the hood. It's been like that for 13+ years, but this may change: https://issues.apache.org/jira/browse/LUCENE-4258 Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Apr 3, 2013 at 9:15 PM, Furkan KAMACI furkankam...@gmail.com wrote: OK, This could be a so easy question but I want to learn just a bit more technical detail of it. When I use Nutch to send documents to Solr to be indexing there are two parameters: -index and -reindex. What Solr does at each one different from the other one?
Re: Difference Between Indexing and Reindexing
I craw webages with Nutch and send them to Solr for indexing. There are two parameters to send data into Solr. One of them is -index and the other one is -reindex. I just want to learn what they do. 2013/4/4 Jack Krupansky j...@basetechnology.com Technically, update and add are identical from a user perspective - you don't need to worry about whether the document already exists. But, there is another, newer form of update, selective or atomic which is updating a subset of the fields in an existing document without needing to re-send all of the other fields of the existing document. See: http://wiki.apache.org/solr/**Atomic_Updateshttp://wiki.apache.org/solr/Atomic_Updates But... none of this has to do with indexing vs. reindexing... you need to be clear what real question you are trying to ask, otherwise we can keeping following your questions, answering each in detail, bouncing all over the place without understanding what it is that you are really looking for. More specifically, what exactly is the problem you are trying to solve? -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Thursday, April 04, 2013 2:45 AM To: solr-user@lucene.apache.org Subject: Re: Difference Between Indexing and Reindexing Hi Otis, then what is the difference between add and update? And how we update or add documents into Solr (I see that there is just one update handler)? 2013/4/4 Otis Gospodnetic otis.gospodne...@gmail.com I don't recall what Nutch does, so it's hard to tell. In Solr (Lucene, really), you can: * add documents * update documents * delete documents Currently, update is really a delete + readd under the hood. It's been like that for 13+ years, but this may change: https://issues.apache.org/**jira/browse/LUCENE-4258https://issues.apache.org/jira/browse/LUCENE-4258 Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Apr 3, 2013 at 9:15 PM, Furkan KAMACI furkankam...@gmail.com wrote: OK, This could be a so easy question but I want to learn just a bit more technical detail of it. When I use Nutch to send documents to Solr to be indexing there are two parameters: -index and -reindex. What Solr does at each one different from the other one?
Re: Difference Between Indexing and Reindexing
I use Nutch 2.1 and using that: bin/nutch solrindex http://localhost:8983/solr -index bin/nutch solrindex http://localhost:8983/solr -reindex 2013/4/4 Gora Mohanty g...@mimirtech.com On 4 April 2013 18:33, Furkan KAMACI furkankam...@gmail.com wrote: I craw webages with Nutch and send them to Solr for indexing. There are two parameters to send data into Solr. One of them is -index and the other one is -reindex. I just want to learn what they do. [...] Which version of Nutch are you using? Unless I have completely missed something, both 1.6 and 2.1 use solrindex: http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_Nutch Where do you see -index and -reindex? Regards, Gora
Re: Difference Between Indexing and Reindexing
It may be a deprecated usage(maybe not) but certainly can run -index and -reindex on Nutch 2.1. 2013/4/4 Gora Mohanty g...@mimirtech.com On 4 April 2013 20:16, Gora Mohanty g...@mimirtech.com wrote: On 4 April 2013 19:29, Furkan KAMACI furkankam...@gmail.com wrote: I use Nutch 2.1 and using that: bin/nutch solrindex http://localhost:8983/solr -index bin/nutch solrindex http://localhost:8983/solr -reindex [...] Sorry, but are you sure that you are using 2.1. Here is what I get with: ./bin/nutch solrindex [...] I am running in local mode, however, as I do not currently have access to a Hadoop cluster. Regards, Gora
Re: Filtering Search Cloud
Ok, I will test and give you a detailed report for it, thanks for your help. 2013/4/5 Erick Erickson erickerick...@gmail.com I cannot emphasize strongly enough that you need to _prove_ you have a problem before you decide on a solution! Do you have any evidence that solrcloud can't handle the load you intend? Might a better approach be just to create more shards thus spreading the load and get all the HA/DR goodness of SolrCloud? So far you've said you'll have a heavy load without giving us any numbers. 10,000 update/second? 10 updates/second? 1 query/second? 100,000 queries/second? 100,000 documents? 1,000,000,000,000 documents? Best Erick On Wed, Apr 3, 2013 at 5:15 PM, Shawn Heisey s...@elyograg.org wrote: On 4/3/2013 1:52 PM, Furkan KAMACI wrote: Thanks for your explanation, you explained every thing what I need. Just one more question. I see that I can not make it with Solr Cloud, but I can do something like that with master-slave replication of Solr. If I use master-slave replication of Solr, can I eliminate (filter) something (something that is indexed from master) from being a response after querying (querying from slaves) ? I don't understand the question. I will attempt to give you more information, but it might not answer your question. If not, you'll have to try to improve your question. Your master and each of that master's slaves will have the same index as soon as replication is done. A query on the slave has no idea that the master exists. Thanks, Shawn
Re: Flow Chart of Solr
of the package and class names are OBVIOUS, really, and follow the class hierarchy and code flow using the standard features of any modern Java IDE. If you are wondering where to start for some specific user-level feature, please ask specifically about that feature. But... make a diligent effort to discover and learn on your own before asking open-ended questions. Sure, there are lots of things in Lucene and Solr that are rather complex and seemingly convoluted, and not obvious, but people are more than willing to help you out if you simply ask a specific question. I mean, not everybody needs to know the fine detail of query parsing, analysis, building a Lucene-level stemmer, etc. If we tried to put all of that in a diagram, most people would be more confused than enlightened. At which step are scores calculated? That's more of a Lucene question. Or, are you really asking what code in Solr invokes Lucene search methods that calculate basic scores? In short, you need to be more specific. Don't force us to guess what problem you are trying to solve. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Wednesday, April 03, 2013 6:52 AM To: solr-user@lucene.apache.org Subject: Re: Flow Chart of Solr So, all in all, is there anybody who can write down just main steps of Solr(including parsing, stemming etc.)? 2013/4/2 Furkan KAMACI furkankam...@gmail.com I think about myself as an example. I have started to make research about Solr just for some weeks. I have learned Solr and its related projects. My next step writing down the main steps Solr. We have separated learning curve of Solr into two main categories. First one is who are using it as out of the box components. Second one is developer side. Actually developer side branches into two way. First one is general steps of it. i.e. document comes into Solr (i.e. crawled data of Nutch). which analyzing processes are going to done (stamming, hamming etc.), what will be doing after parsing step by step. When a search query happens what happens step by step, at which step scores are calculated so on so forth. Second one is more code specific i.e. which handlers takes into account data that will going to be indexed(no need the explain every handler at this step) . Which are the analyzer, tokenizer classes and what are the flow between them. How response handlers works and what are they. Also explaining about cloud side is other work. Some of explanations are currently presents at wiki (but some of them are at very deep places at wiki and it is not easy to find the parent topic of it, maybe starting wiki from a top age and branching all other topics as possible as from it could be better) If we could show the big picture, and beside of it the smaller pictures within it, it would be great (if you know the main parts it will be easy to go deep into the code i.e. you don't need to explain every handler, if you show the way to the developer he/she could debug and find the needs) When I think about myself as an example, I have to write down the steps of Solr a bit detail even I read many pages at wiki and a book about it, I see that it is not easy even writing down the big picture of developer side. 2013/4/2 Alexandre Rafalovitch arafa...@gmail.com Yago, My point - perhaps lost in too much text - was that Solr is presented - and can function - as a black-box. Which makes it different from more traditional open-source project. So, the stage-2 happens exactly when the non-programmers have to cross the boundary from the black-box into code-first approach and the hand-off is not particularly smooth. Or even when - say - php or .Net programmer tries to get beyond the basic operations their client library and has the understand the server-side aspects of Solr. Regards, Alex. On Tue, Apr 2, 2013 at 1:19 PM, Yago Riveiro yago.rive...@gmail.com wrote: Alexandre, You describe the normal path when a beginner try to use a source of code that doesn't understand, black-box, reading code, hacking, ok now I know 10% of the project, with lucky :p. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Sharing index amongst multiple nodes
Hi Daire Mac Mathúna; If there is a way copying one Solr's indexes into another Solr instance, this may also solve the problem. Somebody generates indexes and some of other instances could get a copy of them. At synchronizing process you may eliminate some of indexes at reader instance. So you can filter something to become unsearchable. *This may not be efficient and good thing and maybe solved with built-in functionality somehow.* However I think somebody may need that mechanism. 2013/4/6 Amit Nithian anith...@gmail.com I don't understand why this would be more performant.. seems like it'd be more memory and resource intensive as you'd have multiple class-loaders and multiple cache spaces for no good reason. Just have a single core with sufficiently large caches to handle your response needs. If you want to load balance reads consider having multiple physical nodes with a master/slaves or SolrCloud. On Sat, Apr 6, 2013 at 9:21 AM, Daire Mac Mathúna daire...@gmail.com wrote: Hi. Wat are the thoughts on having multiple SOLR instances i.e. multiple SOLR war files, sharing the same index (i.e. sharing the same solr_home) where only one SOLR instance is used for writing and the others for reading? Is this possible? Is it beneficial - is it more performant than having just one solr instance? How does it affect auto-commits i.e. how would the read nodes know the index has been changed and re-populate cache etc.? Sole 3.6.1 Thanks.
Pointing to Hbase for Docuements or Directly Saving Documents at Hbase
Hi; First of all should mention that I am new to Solr and making a research about it. What I am trying to do that I will crawl some websites with Nutch and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 ) I wonder about something. I have a cloud of machines that crawls websites and stores that documents. Then I send that documents into SolrCloud. Solr indexes that documents and generates indexes and save them. I know that from Information Retrieval theory: it *may* not be efficient to store indexes at a NoSQL database (they are something like linked lists and if you store them in such kind of database you *may* have a sparse representation -by the way there may be some solutions for it. If you explain them you are welcome.) However Solr stores some documents too (i.e. highlights) So some of my documents will be doubled somehow. If I consider that I will have many documents, that dobuled documents may cause a problem for me. So is there any way not storing that documents at Solr and pointing to them at Hbase(where I save my crawled documents) or instead of pointing directly storing them at Hbase (is it efficient or not)?
Re: Sharing index amongst multiple nodes
Hi Walter; I am new to Solr and digging into code to understand it. I think that when indexer copies indexes, before the commit it is unsearchable. Where exactly that commit occurs at code and can I say that: rollback something because I don't want that indexes (reason maybe anything else, maybe I will decline some indexes(index filtering) because of the documents they points. Is it possible? 2013/4/7 Walter Underwood wun...@wunderwood.org This is precisely how Solr replication works. It copies the indexes then does a commit. wunder On Apr 6, 2013, at 2:40 PM, Furkan KAMACI wrote: Hi Daire Mac Mathúna; If there is a way copying one Solr's indexes into another Solr instance, this may also solve the problem. Somebody generates indexes and some of other instances could get a copy of them. At synchronizing process you may eliminate some of indexes at reader instance. So you can filter something to become unsearchable. *This may not be efficient and good thing and maybe solved with built-in functionality somehow.* However I think somebody may need that mechanism. 2013/4/6 Amit Nithian anith...@gmail.com I don't understand why this would be more performant.. seems like it'd be more memory and resource intensive as you'd have multiple class-loaders and multiple cache spaces for no good reason. Just have a single core with sufficiently large caches to handle your response needs. If you want to load balance reads consider having multiple physical nodes with a master/slaves or SolrCloud. On Sat, Apr 6, 2013 at 9:21 AM, Daire Mac Mathúna daire...@gmail.com wrote: Hi. Wat are the thoughts on having multiple SOLR instances i.e. multiple SOLR war files, sharing the same index (i.e. sharing the same solr_home) where only one SOLR instance is used for writing and the others for reading? Is this possible? Is it beneficial - is it more performant than having just one solr instance? How does it affect auto-commits i.e. how would the read nodes know the index has been changed and re-populate cache etc.? Sole 3.6.1 Thanks. -- Walter Underwood wun...@wunderwood.org
Re: Sharing index amongst multiple nodes
Hi Walter; Thanks for your explanation. You said Indexing happens on one Solr server. Is it true even for SolrCloud? 2013/4/7 Walter Underwood wun...@wunderwood.org Indexing happens on one Solr server. After a commit, the documents are searchable. In Solr 4, there is a soft commit, which makes the documents searchable, but does not create on-disk indexes. Solr replication copies the committed indexes to another Solr server. Solr Cloud uses a transaction log to make documents available before a hard commit. Solr does not have rollback. A commit succeeds or fails. After it succeeds, there is no going back. wunder On Apr 6, 2013, at 3:08 PM, Furkan KAMACI wrote: Hi Walter; I am new to Solr and digging into code to understand it. I think that when indexer copies indexes, before the commit it is unsearchable. Where exactly that commit occurs at code and can I say that: rollback something because I don't want that indexes (reason maybe anything else, maybe I will decline some indexes(index filtering) because of the documents they points. Is it possible? 2013/4/7 Walter Underwood wun...@wunderwood.org This is precisely how Solr replication works. It copies the indexes then does a commit. wunder On Apr 6, 2013, at 2:40 PM, Furkan KAMACI wrote: Hi Daire Mac Mathúna; If there is a way copying one Solr's indexes into another Solr instance, this may also solve the problem. Somebody generates indexes and some of other instances could get a copy of them. At synchronizing process you may eliminate some of indexes at reader instance. So you can filter something to become unsearchable. *This may not be efficient and good thing and maybe solved with built-in functionality somehow.* However I think somebody may need that mechanism. 2013/4/6 Amit Nithian anith...@gmail.com I don't understand why this would be more performant.. seems like it'd be more memory and resource intensive as you'd have multiple class-loaders and multiple cache spaces for no good reason. Just have a single core with sufficiently large caches to handle your response needs. If you want to load balance reads consider having multiple physical nodes with a master/slaves or SolrCloud. On Sat, Apr 6, 2013 at 9:21 AM, Daire Mac Mathúna daire...@gmail.com wrote: Hi. Wat are the thoughts on having multiple SOLR instances i.e. multiple SOLR war files, sharing the same index (i.e. sharing the same solr_home) where only one SOLR instance is used for writing and the others for reading? Is this possible? Is it beneficial - is it more performant than having just one solr instance? How does it affect auto-commits i.e. how would the read nodes know the index has been changed and re-populate cache etc.? Sole 3.6.1 Thanks. -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org
Re: Sharing index amongst multiple nodes
My last questions. 1) If I sent document to a replica does it pass document to shard leader and do you mean that even if I send document to shard leader does it can pass that document one of replicas to be indexed. 2) Does it possible to copy a shard into another shard, or merge them? By the way thanks for your explanations. 2013/4/7 Walter Underwood wun...@wunderwood.org In Solr Cloud, a document is indexed on the shard leader. The replicas in that shard get the document and add it to their indexes. There is some indexing that happens on the replicas, but that is managed by Solr. wunder On Apr 6, 2013, at 3:58 PM, Furkan KAMACI wrote: Hi Walter; Thanks for your explanation. You said Indexing happens on one Solr server. Is it true even for SolrCloud? 2013/4/7 Walter Underwood wun...@wunderwood.org Indexing happens on one Solr server. After a commit, the documents are searchable. In Solr 4, there is a soft commit, which makes the documents searchable, but does not create on-disk indexes. Solr replication copies the committed indexes to another Solr server. Solr Cloud uses a transaction log to make documents available before a hard commit. Solr does not have rollback. A commit succeeds or fails. After it succeeds, there is no going back. wunder On Apr 6, 2013, at 3:08 PM, Furkan KAMACI wrote: Hi Walter; I am new to Solr and digging into code to understand it. I think that when indexer copies indexes, before the commit it is unsearchable. Where exactly that commit occurs at code and can I say that: rollback something because I don't want that indexes (reason maybe anything else, maybe I will decline some indexes(index filtering) because of the documents they points. Is it possible? 2013/4/7 Walter Underwood wun...@wunderwood.org This is precisely how Solr replication works. It copies the indexes then does a commit. wunder On Apr 6, 2013, at 2:40 PM, Furkan KAMACI wrote: Hi Daire Mac Mathúna; If there is a way copying one Solr's indexes into another Solr instance, this may also solve the problem. Somebody generates indexes and some of other instances could get a copy of them. At synchronizing process you may eliminate some of indexes at reader instance. So you can filter something to become unsearchable. *This may not be efficient and good thing and maybe solved with built-in functionality somehow.* However I think somebody may need that mechanism. 2013/4/6 Amit Nithian anith...@gmail.com I don't understand why this would be more performant.. seems like it'd be more memory and resource intensive as you'd have multiple class-loaders and multiple cache spaces for no good reason. Just have a single core with sufficiently large caches to handle your response needs. If you want to load balance reads consider having multiple physical nodes with a master/slaves or SolrCloud. On Sat, Apr 6, 2013 at 9:21 AM, Daire Mac Mathúna daire...@gmail.com wrote: Hi. Wat are the thoughts on having multiple SOLR instances i.e. multiple SOLR war files, sharing the same index (i.e. sharing the same solr_home) where only one SOLR instance is used for writing and the others for reading? Is this possible? Is it beneficial - is it more performant than having just one solr instance? How does it affect auto-commits i.e. how would the read nodes know the index has been changed and re-populate cache etc.? Sole 3.6.1 Thanks. -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org
Prediction About Index Sizes of Solr
This may not be a well detailed question but I will try to make it clear. I am crawling web pages and will index them at SolrCloud 4.2. What I want to predict is the index size. I will have approximately 2 billion web pages and I consider each of them will be 100 Kb. I know that it depends on storing documents, stop words. etc. etc. If you want to ask about detail of my question I may give you more explanation. However there should be some analysis to help me because I should predict something about what will be the index size for me. On the other hand my other important question is how SolrCloud makes replicas for indexes, can I change it how many replicas will be. Because I should multiply the total amount of index size with replica size. Here I found an article related to my analysis: http://juanggrande.wordpress.com/2010/12/20/solr-index-size-analysis/ I know this question may not be details but if you give ideas about it you are welcome.
Solr Admin Page Master Size
When I check my Solr Admin Page: Replication (Master) Version Gen Size Master: 1365458125729 5 18.24 MB It is a one shard one computer. What is that 18.24 MB. Does it contains just indexes or indexes, highlights etc. etc.? My solr home folder was 512.7 KB and it has become 22860 KB that is why I ask this question.
Average Solr Server Spec.
This question may not have a generel answer and may be open ended but is there any commodity server spec. for a usual Solr running machine? I mean what is the average server spesification for a Solr machine (i.e. Hadoop running system it is not recommended to have very big storage capably computers.) I will use Solr for indexing web crawled data.
Re: How can I set configuration options?
Hi Edd; The parameters you mentioned are JVM parameters. There are two ways to define them. First one is if you are using an IDE you can indicate them as JVM parameters. i.e. if you are using Intellij IDEA when you click your Run/Debug configurations there is a line called VM Options. You can write your paramters without writing java word in front of them. Second one is deploying your war file into Tomcat without using an IDE (I think this is what you want). Here is what to do: Go to tomcat home folder and under the bin folder create a file called setenv.sh Then add that lines: #!/bin/sh # # export JAVA_OPTS=$JAVA_OPTS -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2 2013/4/9 Edd Grant e...@eddgrant.com Hi all, I have been working through the examples on the SolrCloud page: http://wiki.apache.org/solr/SolrCloud I am now at the point where, rather than firing up Solr through start.jar, I'm deploying the Solr war in to Tomcat instances. Taking the following command as an example: java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DzkHost=localhost:9983,localhost:8574,localhost:9900 -DnumShards=2 -jar start.jar I can't figure out from the documentation how/ where I set the above properties when deploying Solr as a war file. I initially thought these might be configurable through solr.xml but can't find anything in the documentation to support this. Most grateful for any pointers here. Cheers, Edd -- Web: http://www.eddgrant.com Email: e...@eddgrant.com Mobile: +44 (0) 7861 394 543
Re: Average Solr Server Spec.
Hi Walter; Could I learn that what is the average size of Solr indexes and average query per second to your Solr. Maybe I can come up with an assumption? 2013/4/9 Walter Underwood wun...@wunderwood.org We mostly run m1.xlarge with an 8GB heap. --wunder On Apr 9, 2013, at 10:57 AM, Otis Gospodnetic wrote: Hi, You are right there is no average. I saw a Solr cluster with a few EC2 micro instances yesterday and regularly see Solr running on 16 or 32 GB RAM and sometimes well over 100 GB RAM. Sometimes they have just 2 CPU cores, sometimes 32 or more. Some use SSDs, some HDDs, some local storage, some SAN, some EBS on AWS. etc. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Apr 9, 2013 at 7:04 AM, Furkan KAMACI furkankam...@gmail.com wrote: This question may not have a generel answer and may be open ended but is there any commodity server spec. for a usual Solr running machine? I mean what is the average server spesification for a Solr machine (i.e. Hadoop running system it is not recommended to have very big storage capably computers.) I will use Solr for indexing web crawled data.
Re: Slow qTime for distributed search
Hi Shawn; You say that: *... your documents are about 50KB each. That would translate to an index that's at least 25GB* I know we can not say an exact size but what is the approximately ratio of document size / index size according to your experiences? 2013/4/9 Shawn Heisey s...@elyograg.org On 4/9/2013 2:10 PM, Manuel Le Normand wrote: Thanks for replying. My config: - 40 dedicated servers, dual-core each - Running Tomcat servlet on Linux - 12 Gb RAM per server, splitted half between OS and Solr - Complex queries (up to 30 conditions on different fields), 1 qps rate Sharding my index was done for two reasons, based on 2 servers (4shards) tests: 1. As index grew above few million of docs qTime raised greatly, while sharding the index to smaller pieces (about 0.5M docs) gave way better results, so I bound every shard to have 0.5M docs. 2. Tests showed i was cpu-bounded during queries. As i have low qps rate (emphasize: lower than expected qTime) and as a query runs single-threaded on each shard, it made sense to accord a cpu to each shard. For the same amount of docs per shards I do expect a raise in total qTime for the reasons: 1. The response should wait for the slowest shard 2. Merging the responses from 40 different shards takes time What i understand from your explanation is that it's the merging that takes time and as qTime ends only after the second retrieval phase, the qTime on each shard will take longer. Meaning during a significant proportion of the first query phase (right after the [id,score] are retieved), all cpu's are idle except the response-merger thread running on a single cpu. I thought of the merge as a simple sorting of [id,score], way more simple than additional 300 ms cpu time. Why would a RAM increase improve my performances, as it's a response-merge (CPU resource) bottleneck? If you have not tweaked the Tomcat configuration, that can lead to problems, but if your total query volume is really only one query per second, this is probably not a worry for you. A tomcat connector can be configured with a maxThreads parameter. The recommended value there is 1, but Tomcat defaults to 200. You didn't include the index sizes. There's half a million docs per shard, but I don't know what that translates to in terms of MB or GB of disk space. On another email thread you mention that your documents are about 50KB each. That would translate to an index that's at least 25GB, possibly more. That email thread also says that optimization for you takes an hour, further indications that you've got some really big indexes. You're saying that you have given 6GB out of the 12GB to Solr, leaving only 6GB for the OS and caching. Ideally you want to have enough RAM to cache the entire index, but in reality you can usually get away with caching between half and two thirds of the index. Exactly what ratio works best is highly dependent on your schema. If my numbers are even close to right, then you've got a lot more index on each server than available RAM. Based on what I can deduce, you would want 24 to 48GB of RAM per server. If my numbers are wrong, then this estimate is wrong. I would be interested in seeing your queries. If the complexity can be expressed as filter queries that get re-used a lot, the filter cache can be a major boost to performance. Solr's caches in general can make a big difference. There is no guarantee that caches will help, of course. Thanks, Shawn
Approximately needed RAM for 5000 query/second at a Solr machine?
Are there anybody who can help me about how to guess the approximately needed RAM for 5000 query/second at a Solr machine?
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
Actually I will propose a system and I should figure out about machine specifications. There will be no faceting mechanism at first, just simple search queries of a web search engine. We can think that I will have a commodity server (I don't know is there any benchmark for a usual Solr machine) 2013/4/10 Jack Krupansky j...@basetechnology.com It all depends on the nature of your query and the nature of the data in the index. Does returning results from a result cache count in your QPS? Not to mention how many cores and CPU speed and CPU caching as well. Not to mention network latency. The best way to answer is to do a proof of concept implementation and measure it yourself. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Tuesday, April 09, 2013 6:06 PM To: solr-user@lucene.apache.org Subject: Approximately needed RAM for 5000 query/second at a Solr machine? Are there anybody who can help me about how to guess the approximately needed RAM for 5000 query/second at a Solr machine?
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
Hi Walter; Firstly thank for your detailed reply. I know that this is not a well detailed question but I don't have any metrics yet. If we talk about your system, what is the average RAM size of your Solr machines? Maybe that can help me to make a comparison. 2013/4/10 Walter Underwood wun...@wunderwood.org On Apr 9, 2013, at 3:06 PM, Furkan KAMACI wrote: Are there anybody who can help me about how to guess the approximately needed RAM for 5000 query/second at a Solr machine? No. That depends on the kind of queries you have, the size and content of the index, the required response time, how frequently the index is updated, and many more factors. So anyone who can guess that is wrong. You can only find that out by running your own benchmarks with your own queries against your own index. In our system, we can meet our response time requirements at a rate of 4000 queries/minute. We have several cores, but most traffic goes to a 3M document index. This index is small documents, mostly titles and authors of books. We have no wildcard queries and less than 5% of our queries use fuzzy matching. We update once per day and have cache hit rates of around 30%. We run new benchmarks twice each year, before our busy seasons. We use the current index and configuration and the queries from the busiest day of the previous season. Our key benchmark is the 95th percentile response time, but we also measure median, 90th, and 99th percentile. We are currently on Solr 3.3 with some customizations. We're working on transitioning to Solr 4. wunder -- Walter Underwood wun...@wunderwood.org
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
Thanks for your answer. 2013/4/10 Walter Underwood wun...@wunderwood.org We are using Amazon EC2 M1 Extra Large instances (m1.xlarge). http://aws.amazon.com/ec2/instance-types/ wunder On Apr 9, 2013, at 3:35 PM, Furkan KAMACI wrote: Hi Walter; Firstly thank for your detailed reply. I know that this is not a well detailed question but I don't have any metrics yet. If we talk about your system, what is the average RAM size of your Solr machines? Maybe that can help me to make a comparison. 2013/4/10 Walter Underwood wun...@wunderwood.org On Apr 9, 2013, at 3:06 PM, Furkan KAMACI wrote: Are there anybody who can help me about how to guess the approximately needed RAM for 5000 query/second at a Solr machine? No. That depends on the kind of queries you have, the size and content of the index, the required response time, how frequently the index is updated, and many more factors. So anyone who can guess that is wrong. You can only find that out by running your own benchmarks with your own queries against your own index. In our system, we can meet our response time requirements at a rate of 4000 queries/minute. We have several cores, but most traffic goes to a 3M document index. This index is small documents, mostly titles and authors of books. We have no wildcard queries and less than 5% of our queries use fuzzy matching. We update once per day and have cache hit rates of around 30%. We run new benchmarks twice each year, before our busy seasons. We use the current index and configuration and the queries from the busiest day of the previous season. Our key benchmark is the 95th percentile response time, but we also measure median, 90th, and 99th percentile. We are currently on Solr 3.3 with some customizations. We're working on transitioning to Solr 4. wunder -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org
Re: Pushing a whole set of pdf-files to solr
Apache Solr 4 Cookbok says that: curl http://localhost:8983/solr/update/extract?literal.id=1commit=true; -F myfile=@cookbook.pdf is that what you want? 2013/4/10 sdspieg sdsp...@mail.ru If anybody could still help me out with this, I'd really appreciate it. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054885.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
These are really good metrics for me: You say that RAM size should be at least index size, and it is better to have a RAM size twice the index size (because of worst case scenario). On the other hand let's assume that I have a RAM size that is bigger than twice of indexes at machine. Can Solr use that extra RAM or is it a approximately maximum limit (to have twice size of indexes at machine)? 2013/4/10 Shawn Heisey s...@elyograg.org On 4/9/2013 4:06 PM, Furkan KAMACI wrote: Are there anybody who can help me about how to guess the approximately needed RAM for 5000 query/second at a Solr machine? You've already gotten some good replies, and I'm aware that they haven't really answered your question. This is the kind of question that cannot be answered. The amount of RAM that you'll need for extreme performance actually isn't hard to figure out - you need enough free RAM for the OS to cache the maximum amount of disk space all your indexes will ever use. Normally this will be twice the size of all the indexes on the machine, because that's how much disk space will likely be used in a worst-case merge scenario (optimize). That's very expensive, so it is cheaper to budget for only the size of the index. A load of 5000 queries per second is pretty high, and probably something you will not achieve with a single-server (not counting backup) approach. All of the tricks that high-volume website developers use are also applicable to Solr. Once you have enough RAM, you need to worry more about the number of servers, the number of CPU cores in each server, and the speed of those CPU cores. Testing with actual production queries is the only way to find out what you really need. Beyond hardware design, making the requests as simple as possible and taking advantage of caches is important. Solr has caches for queries, filters, and documents. You can also put a caching proxy (something like Varnish) in front of Solr, but that would make NRT updates pretty much impossible, and that kind of caching can be difficult to get working right. Thanks, Shawn
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
I am sorry but you said: *you need enough free RAM for the OS to cache the maximum amount of disk space all your indexes will ever use* I have made an assumption my indexes at my machine. Let's assume that it is 5 GB. So it is better to have at least 5 GB RAM? OK, Solr will use RAM up to how much I define it as a Java processes. When we think about the indexes at storage and caching them at RAM by OS, is that what you talk about: having more than 5 GB - or - 10 GB RAM for my machine? 2013/4/10 Shawn Heisey s...@elyograg.org On 4/9/2013 7:03 PM, Furkan KAMACI wrote: These are really good metrics for me: You say that RAM size should be at least index size, and it is better to have a RAM size twice the index size (because of worst case scenario). On the other hand let's assume that I have a RAM size that is bigger than twice of indexes at machine. Can Solr use that extra RAM or is it a approximately maximum limit (to have twice size of indexes at machine)? What we have been discussing is the OS cache, which is memory that is not used by programs. The OS uses that memory to make everything run faster. The OS will instantly give that memory up if a program requests it. Solr is a java program, and java uses memory a little differently, so Solr most likely will NOT use more memory when it is available. In a normal directly executable program, memory can be allocated at any time, and given back to the system at any time. With Java, you tell it the maximum amount of memory the program is ever allowed to use. Because of how memory is used inside Java, most long-running Java programs (like Solr) will allocate up to the configured maximum even if they don't really need that much memory. Most Java virtual machines will never give the memory back to the system even if it is not required. Thanks, Shawn
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
Thank you for your explanations, this will help me to figure out my system. 2013/4/10 Shawn Heisey s...@elyograg.org On 4/9/2013 9:12 PM, Furkan KAMACI wrote: I am sorry but you said: *you need enough free RAM for the OS to cache the maximum amount of disk space all your indexes will ever use* I have made an assumption my indexes at my machine. Let's assume that it is 5 GB. So it is better to have at least 5 GB RAM? OK, Solr will use RAM up to how much I define it as a Java processes. When we think about the indexes at storage and caching them at RAM by OS, is that what you talk about: having more than 5 GB - or - 10 GB RAM for my machine? If your index is 5GB, and you give 3GB of RAM to the Solr JVM, then you would want at least 8GB of total RAM for that machine - the 3GB of RAM given to Solr, plus the rest so the OS can cache the index in RAM. If you plan for double the cache memory, you'd need 13 to 14GB. Thanks, Shawn
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
Hi Walter; Is there any document or something else says that worst case is three times of disk space? Twice times or three times. It is really different when we talk about GB's of disk spaces. 2013/4/10 Walter Underwood wun...@wunderwood.org Correct, except the worst case maximum for disk space is three times. --wunder On Apr 10, 2013, at 6:04 AM, Erick Erickson wrote: You're mixing up disk and RAM requirements when you talk about having twice the disk size. Solr does _NOT_ require twice the index size of RAM to optimize, it requires twice the size on _DISK_. In terms of RAM requirements, you need to create an index, run realistic queries at the installation and measure. Best Erick On Tue, Apr 9, 2013 at 10:32 PM, bigjust bigj...@lambdaphil.es wrote: On 4/9/2013 7:03 PM, Furkan KAMACI wrote: These are really good metrics for me: You say that RAM size should be at least index size, and it is better to have a RAM size twice the index size (because of worst case scenario). On the other hand let's assume that I have a RAM size that is bigger than twice of indexes at machine. Can Solr use that extra RAM or is it a approximately maximum limit (to have twice size of indexes at machine)? What we have been discussing is the OS cache, which is memory that is not used by programs. The OS uses that memory to make everything run faster. The OS will instantly give that memory up if a program requests it. Solr is a java program, and java uses memory a little differently, so Solr most likely will NOT use more memory when it is available. In a normal directly executable program, memory can be allocated at any time, and given back to the system at any time. With Java, you tell it the maximum amount of memory the program is ever allowed to use. Because of how memory is used inside Java, most long-running Java programs (like Solr) will allocate up to the configured maximum even if they don't really need that much memory. Most Java virtual machines will never give the memory back to the system even if it is not required. Thanks, Shawn Furkan KAMACI furkankam...@gmail.com writes: I am sorry but you said: *you need enough free RAM for the OS to cache the maximum amount of disk space all your indexes will ever use* I have made an assumption my indexes at my machine. Let's assume that it is 5 GB. So it is better to have at least 5 GB RAM? OK, Solr will use RAM up to how much I define it as a Java processes. When we think about the indexes at storage and caching them at RAM by OS, is that what you talk about: having more than 5 GB - or - 10 GB RAM for my machine? 2013/4/10 Shawn Heisey s...@elyograg.org 10 GB. Because when Solr shuffles the data around, it could use up to twice the size of the index in order to optimize the index on disk. -- Justin -- Walter Underwood wun...@wunderwood.org
Re: migration solr 3.5 to 4.1 - JVM GC problems
Hi Marc; Could I learn your index size and what is your performance measure as query per second? 2013/4/11 Marc Des Garets marc.desgar...@192.com Big heap because very large number of requests with more than 60 indexes and hundreds of million of documents (all indexes together). My problem is with solr 4.1. All is perfect with 3.5. I have 0.05 sec GCs every 1 or 2mn and 20Gb of the heap is used. With the 4.1 indexes it uses 30Gb-33Gb, the survivor space is all weird (it changed the size capacity to 6Mb at some point) and I have 2 sec GCs every minute. There must be something that has changed in 4.1 compared to 3.5 to cause this behavior. It's the same requests, same schemas (excepted 4 fields changed from sint to tint) and same config. On 04/10/2013 07:38 PM, Shawn Heisey wrote: On 4/10/2013 9:48 AM, Marc Des Garets wrote: The JVM behavior is now radically different and doesn't seem to make sense. I was using ConcMarkSweepGC. I am now trying the G1 collector. The perm gen went from 410Mb to 600Mb. The eden space usage is a lot bigger and the survivor space usage is 100% all the time. I don't really understand what is happening. GC behavior really doesn't seem right. My jvm settings: -d64 -server -Xms40g -Xmx40g -XX:+UseG1GC -XX:NewRatio=1 -XX:SurvivorRatio=3 -XX:PermSize=728m -XX:MaxPermSize=728m As Otis has already asked, why do you have a 40GB heap? The only way I can imagine that you would actually NEED a heap that big is if your index size is measured in hundreds of gigabytes. If you really do need a heap that big, you will probably need to go with a JVM like Zing. I don't know how much Zing costs, but they claim to be able to make any heap size perform well under any load. It is Linux-only. I was running into extreme problems with GC pauses with my own setup, and that was only with an 8GB heap. I was using the CMS collector and NewRatio=1. Switching to G1 didn't help at all - it might have even made the problem worse. I never did try the Zing JVM. After a lot of experimentation (which I will admit was not done very methodically) I found JVM options that have reduced the GC pause problem greatly. Below is what I am using now on Solr 4.2.1 with a total per-server index size of about 45GB. This works properly on CentOS 6 with Oracle Java 7u17, UseLargePages may require special kernel tuning on other operating systems: -Xmx6144M -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:NewRatio=3 -XX:MaxTenuringThreshold=8 -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -XX:+UseLargePages -XX:+AggressiveOpts These options could probably use further tuning, but I haven't had time for the kind of testing that will be required. If you decide to pay someone to make the problem going away instead: http://www.azulsystems.com/products/zing/whatisit Thanks, Shawn This transmission is strictly confidential, possibly legally privileged, and intended solely for the addressee. Any views or opinions expressed within it are those of the author and do not necessarily represent those of 192.com Ltd or any of its subsidiary companies. If you are not the intended recipient then you must not disclose, copy or take any action in reliance of this transmission. If you have received this transmission in error, please notify the sender as soon as possible. No employee or agent is authorised to conclude any binding agreement on behalf 192.com Ltd with another party by email without express written confirmation by an authorised employee of the company. http://www.192.com(Tel: 08000 192 192). 192.com Ltd is incorporated in England and Wales, company number 07180348, VAT No. GB 103226273.
Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase
Actually I don't think to store documents at Solr. I want to store just highlights (snippets) at Hbase and I want to retrieve them from Hbase when needed. What do you think about separating just highlights from Solr and storing them into Hbase at Solrclod. By the way if you explain at which process and how highlights are genareted at Solr you are welcome. 2013/4/9 Otis Gospodnetic otis.gospodne...@gmail.com You may also be interested in looking at things like solrbase (on Github). Otis -- Solr ElasticSearch Support http://sematext.com/ On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI furkankam...@gmail.com wrote: Hi; First of all should mention that I am new to Solr and making a research about it. What I am trying to do that I will crawl some websites with Nutch and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 ) I wonder about something. I have a cloud of machines that crawls websites and stores that documents. Then I send that documents into SolrCloud. Solr indexes that documents and generates indexes and save them. I know that from Information Retrieval theory: it *may* not be efficient to store indexes at a NoSQL database (they are something like linked lists and if you store them in such kind of database you *may* have a sparse representation -by the way there may be some solutions for it. If you explain them you are welcome.) However Solr stores some documents too (i.e. highlights) So some of my documents will be doubled somehow. If I consider that I will have many documents, that dobuled documents may cause a problem for me. So is there any way not storing that documents at Solr and pointing to them at Hbase(where I save my crawled documents) or instead of pointing directly storing them at Hbase (is it efficient or not)?
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
Thanks Walter, you guys gave me really nice ideas about RAM approximation. 2013/4/11 Walter Underwood wun...@wunderwood.org Here is the situation where merging can require 3X space. It can only happen if you force merge, then index with merging turned off, but we had Ultraseek customers do that. * All documents are merged into a single segment. * Without a merge, all documents are replaced. * This results in one segment of deleted documents and one of new documents (2X). * A merge takes place, creating a new segment of the same size, thus 3X. For normal operation, 2X is plenty of room. wunder On Apr 11, 2013, at 6:46 AM, Michael Ryan wrote: I've investigated this in the past. The worst case is 2*indexSize additional disk space (3*indexSize total) during an optimize. In our system, we use LogByteSizeMergePolicy, and used to have a mergeFactor of 10. We would see the worst case happen when there were exactly 20 segments (or some other multiple of 10, I believe) at the start of the optimize. IIRC, it would merge those 20 segments down to 2 segments, and then merge those 2 segments down to 1 segment. 1*indexSize space was used by the original index (because there is still a reader open on it), 1*indexSpace was used by the 2 segments, and 1*indexSize space was used by the 1 segment. This is the worst case because there are two full additional copies of the index on disk. Normally, when the number of segments is not a multiple of the mergeFactor, there will be some part of the index that was not part of both merges (and this part that is excluded usually would be the largest segments). We worked around this by doing multiple optimize passes, where the first pass merges down to between 2 and 2*mergeFactor-1 segments (based on a great tip from Lance Norskog on the mailing list a couple years ago). I'm not sure if the current merge policy implementations still have this issue. -Michael -Original Message- From: Furkan KAMACI [mailto:furkankam...@gmail.com] Sent: Thursday, April 11, 2013 2:44 AM To: solr-user@lucene.apache.org Subject: Re: Approximately needed RAM for 5000 query/second at a Solr machine? Hi Walter; Is there any document or something else says that worst case is three times of disk space? Twice times or three times. It is really different when we talk about GB's of disk spaces. 2013/4/10 Walter Underwood wun...@wunderwood.org Correct, except the worst case maximum for disk space is three times. --wunder On Apr 10, 2013, at 6:04 AM, Erick Erickson wrote: You're mixing up disk and RAM requirements when you talk about having twice the disk size. Solr does _NOT_ require twice the index size of RAM to optimize, it requires twice the size on _DISK_. In terms of RAM requirements, you need to create an index, run realistic queries at the installation and measure. Best Erick On Tue, Apr 9, 2013 at 10:32 PM, bigjust bigj...@lambdaphil.es wrote: On 4/9/2013 7:03 PM, Furkan KAMACI wrote: These are really good metrics for me: You say that RAM size should be at least index size, and it is better to have a RAM size twice the index size (because of worst case scenario). On the other hand let's assume that I have a RAM size that is bigger than twice of indexes at machine. Can Solr use that extra RAM or is it a approximately maximum limit (to have twice size of indexes at machine)? What we have been discussing is the OS cache, which is memory that is not used by programs. The OS uses that memory to make everything run faster. The OS will instantly give that memory up if a program requests it. Solr is a java program, and java uses memory a little differently, so Solr most likely will NOT use more memory when it is available. In a normal directly executable program, memory can be allocated at any time, and given back to the system at any time. With Java, you tell it the maximum amount of memory the program is ever allowed to use. Because of how memory is used inside Java, most long-running Java programs (like Solr) will allocate up to the configured maximum even if they don't really need that much memory. Most Java virtual machines will never give the memory back to the system even if it is not required. Thanks, Shawn Furkan KAMACI furkankam...@gmail.com writes: I am sorry but you said: *you need enough free RAM for the OS to cache the maximum amount of disk space all your indexes will ever use* I have made an assumption my indexes at my machine. Let's assume that it is 5 GB. So it is better to have at least 5 GB RAM? OK, Solr will use RAM up to how much I define it as a Java processes. When we think about the indexes at storage and caching them at RAM by OS, is that what you talk about: having more than 5 GB - or - 10 GB RAM for my machine? 2013/4/10 Shawn Heisey s...@elyograg.org 10 GB. Because when
Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase
Hi Otis; It seems that I should read more about highlights. Is there any where that explains in detail how highlights are generated at Solr? 2013/4/11 Otis Gospodnetic otis.gospodne...@gmail.com Hi, You can't store highlights ahead of time because they are query dependent. You could store documents in HBase and use Solr just for indexing. Is that what you want to do? If so, a custom SearchComponent executed after QueryComponent could fetch data from external store like HBase. I'm not sure if I'd recommend that. Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI furkankam...@gmail.com wrote: Actually I don't think to store documents at Solr. I want to store just highlights (snippets) at Hbase and I want to retrieve them from Hbase when needed. What do you think about separating just highlights from Solr and storing them into Hbase at Solrclod. By the way if you explain at which process and how highlights are genareted at Solr you are welcome. 2013/4/9 Otis Gospodnetic otis.gospodne...@gmail.com You may also be interested in looking at things like solrbase (on Github). Otis -- Solr ElasticSearch Support http://sematext.com/ On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI furkankam...@gmail.com wrote: Hi; First of all should mention that I am new to Solr and making a research about it. What I am trying to do that I will crawl some websites with Nutch and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 ) I wonder about something. I have a cloud of machines that crawls websites and stores that documents. Then I send that documents into SolrCloud. Solr indexes that documents and generates indexes and save them. I know that from Information Retrieval theory: it *may* not be efficient to store indexes at a NoSQL database (they are something like linked lists and if you store them in such kind of database you *may* have a sparse representation -by the way there may be some solutions for it. If you explain them you are welcome.) However Solr stores some documents too (i.e. highlights) So some of my documents will be doubled somehow. If I consider that I will have many documents, that dobuled documents may cause a problem for me. So is there any way not storing that documents at Solr and pointing to them at Hbase(where I save my crawled documents) or instead of pointing directly storing them at Hbase (is it efficient or not)?
Re: Slow qTime for distributed search
Manuel Le Normand, I am sorry but I want to learn something. You said you have 40 dedicated servers. What is your total document count, total document size, and total shard size? 2013/4/11 Manuel Le Normand manuel.lenorm...@gmail.com Hi, We have different working hours, sorry for the reply delay. Your assumed numbers are right, about 25-30Kb per doc. giving a total of 15G per shard, there are two shards per server (+2 slaves that should do no work normally). An average query has about 30 conditions (OR AND mixed), most of them textual, a small part on dateTime. They use only simple queries (no facet, filters etc.) as it is taken from the actual query set of my entreprise that works with an old search engine. As we said, if the shards in collection1 and collection2 have the same number of docs each (and same RAM CPU per shard), it is apparently not a slow IO issue, right? So the fact of not having cached all my index doesn't seem the be the bottleneck.Moreover, i do store the fields but my query set requests only the id's and rarely snippets so I'd assume that the plenty of RAM i'd give the OS wouldn't make any difference as these *.fdt files don't need to get cached. The conclusion i get to is that the merging issue is the problem, and the only possibility of outsmarting it is to distribute to much fewer shards, meaning that i'll get back to few millions of docs per shard which are about linearly slower with the num of docs per shard. Though the latter should improve if i give much more RAM per server. I'll try tweaking a bit my schema and making better use of solr cache (filter query as an example), but i have something telling me the problem might be elsewhere. My main clue to it is that merging seems a simple CPU task, and tests show that even with a small amount of responses it takes a long time (and clearly the merging task on few docs is very short) On Wed, Apr 10, 2013 at 2:50 AM, Shawn Heisey s...@elyograg.org wrote: On 4/9/2013 3:50 PM, Furkan KAMACI wrote: Hi Shawn; You say that: *... your documents are about 50KB each. That would translate to an index that's at least 25GB* I know we can not say an exact size but what is the approximately ratio of document size / index size according to your experiences? If you store the fields, that is actual size plus a small amount of overhead. Starting with Solr 4.1, stored fields are compressed. I believe that it uses LZ4 compression. Some people store all fields, some people store only a few or one - an ID field. The size of stored fields does have an impact on how much OS disk cache you need, but not as much as the other parts of an index. It's been my experience that termvectors take up almost as much space as stored data for the same fields, and sometimes more. Starting with Solr 4.2, termvectors are also compressed. Adding docValues (new in 4.2) to the schema will also make the index larger. The requirements here are similar to stored fields. I do not know whether this data gets compressed, but I don't think it does. As for the indexed data, this is where I am less clear about the storage ratios, but I think you can count on it needing almost as much space as the original data. If the schema uses types or filters that produce a lot of information, the indexed data might be larger than the original input. Examples of data explosions in a schema: trie fields with a non-zero precisionStep, the edgengram filter, the shingle filter. Thanks, Shawn
Re: Approximately needed RAM for 5000 query/second at a Solr machine?
Hi Jack; Due to I am new to Solr, can you explain this two things that you said: 1) when most people say index size they are referring to all fields, collectively, not individual fields (what do you mean with Segments are on a per-field basis and all fields, individual fields.) 2) more cores might make the worst case scenario worse since it will maximize the amount of data processed at a given moment 2013/4/13 Erick Erickson erickerick...@gmail.com bq: disk space is three times True, I keep forgetting about compound since I never use it... On Wed, Apr 10, 2013 at 11:05 AM, Walter Underwood wun...@wunderwood.org wrote: Correct, except the worst case maximum for disk space is three times. --wunder On Apr 10, 2013, at 6:04 AM, Erick Erickson wrote: You're mixing up disk and RAM requirements when you talk about having twice the disk size. Solr does _NOT_ require twice the index size of RAM to optimize, it requires twice the size on _DISK_. In terms of RAM requirements, you need to create an index, run realistic queries at the installation and measure. Best Erick On Tue, Apr 9, 2013 at 10:32 PM, bigjust bigj...@lambdaphil.es wrote: On 4/9/2013 7:03 PM, Furkan KAMACI wrote: These are really good metrics for me: You say that RAM size should be at least index size, and it is better to have a RAM size twice the index size (because of worst case scenario). On the other hand let's assume that I have a RAM size that is bigger than twice of indexes at machine. Can Solr use that extra RAM or is it a approximately maximum limit (to have twice size of indexes at machine)? What we have been discussing is the OS cache, which is memory that is not used by programs. The OS uses that memory to make everything run faster. The OS will instantly give that memory up if a program requests it. Solr is a java program, and java uses memory a little differently, so Solr most likely will NOT use more memory when it is available. In a normal directly executable program, memory can be allocated at any time, and given back to the system at any time. With Java, you tell it the maximum amount of memory the program is ever allowed to use. Because of how memory is used inside Java, most long-running Java programs (like Solr) will allocate up to the configured maximum even if they don't really need that much memory. Most Java virtual machines will never give the memory back to the system even if it is not required. Thanks, Shawn Furkan KAMACI furkankam...@gmail.com writes: I am sorry but you said: *you need enough free RAM for the OS to cache the maximum amount of disk space all your indexes will ever use* I have made an assumption my indexes at my machine. Let's assume that it is 5 GB. So it is better to have at least 5 GB RAM? OK, Solr will use RAM up to how much I define it as a Java processes. When we think about the indexes at storage and caching them at RAM by OS, is that what you talk about: having more than 5 GB - or - 10 GB RAM for my machine? 2013/4/10 Shawn Heisey s...@elyograg.org 10 GB. Because when Solr shuffles the data around, it could use up to twice the size of the index in order to optimize the index on disk. -- Justin -- Walter Underwood wun...@wunderwood.org
Listing Priority
I have crawled some internet pages and indexed them at Solr. When I list my results via Solr I want that: if a page has a URL(my schema includes a field for URL) that ends with .edu, .edu.az or .co.uk I will give more priority to them. How can I do it in a more efficient way at Solr?
Some Questions About Using Solr as Cloud
I read wiki and reading SolrGuide of Lucidworks. However I want to clear something in my mind. Here are my questions: 1) Does SolrCloud lets a multi master design (is there any document that I can read about it)? 2) Let's assume that I use multiple cores i.e. core A and core B. Let's assume that there is a document just indexed at core B. If I send a search request to core A can I get result? 3) When I use multi master design (if exists) can I transfer one master's index data into another (with its slaves or not)? 4) When I use multi core design can I transfer one index data into another core or anywhere else? By the way thanks for the quick responses and kindness at mail list.
Re: Some Questions About Using Solr as Cloud
5) When I use multi core design can I transfer one index data into another core or anywhere else? 6) Does Solr holds old versions of documents or remove them? 2013/4/15 Furkan KAMACI furkankam...@gmail.com I read wiki and reading SolrGuide of Lucidworks. However I want to clear something in my mind. Here are my questions: 1) Does SolrCloud lets a multi master design (is there any document that I can read about it)? 2) Let's assume that I use multiple cores i.e. core A and core B. Let's assume that there is a document just indexed at core B. If I send a search request to core A can I get result? 3) When I use multi master design (if exists) can I transfer one master's index data into another (with its slaves or not)? 4) When I use multi core design can I transfer one index data into another core or anywhere else? By the way thanks for the quick responses and kindness at mail list.
Re: Some Questions About Using Solr as Cloud
Hi Jack; I see that SolrCloud makes everything automated. When I use SolrCloud is it true that: there may be more than one computer responsible for indexing at any time? 2013/4/15 Jack Krupansky j...@basetechnology.com There are no masters or slaves in SolrCloud - it's fully distributed. Some cluster nodes will be leaders (of the shard on that node) at a given point in time, but different nodes may be leaders at different points in time as they become elected. In a distributed cluster you would never want to store documents only on one node. Sure, you can do that by setting the replication factor to 1, but that defeats half the purpose for SolrCloud. Index transfer is automatic - SolrCloud supports fully distributed update. You might be getting confused with the old Master-Slave-Replication model that Solr had (and still has) which is distinct from SolrCloud. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Sunday, April 14, 2013 7:45 PM To: solr-user@lucene.apache.org Subject: Some Questions About Using Solr as Cloud I read wiki and reading SolrGuide of Lucidworks. However I want to clear something in my mind. Here are my questions: 1) Does SolrCloud lets a multi master design (is there any document that I can read about it)? 2) Let's assume that I use multiple cores i.e. core A and core B. Let's assume that there is a document just indexed at core B. If I send a search request to core A can I get result? 3) When I use multi master design (if exists) can I transfer one master's index data into another (with its slaves or not)? 4) When I use multi core design can I transfer one index data into another core or anywhere else? By the way thanks for the quick responses and kindness at mail list.
SolrCloud Leaders
Does number of leaders at a SolrCloud is equal to number of shards?
Re: SolrCloud Leaders
Does leaders may response search requests (I mean do they store indexes) at when I run SolrCloud at first and after a time later? 2013/4/15 Jack Krupansky j...@basetechnology.com When the cluster is fully operational, yes. But if part of the cluster is down or split and unable to communicate, or leader election is in progress, the actual count of leaders will not be indicative of the number of shards. Leaders and shards are apples and oranges. If you take down a cluster, by definition it would have no leaders (because leaders are running code), but shards are the files in the index on disk that continue to exist even if the code is not running. So, in the extreme, the number of leaders can be zero while the number of shards is non-zero on disk. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Monday, April 15, 2013 8:21 AM To: solr-user@lucene.apache.org Subject: SolrCloud Leaders Does number of leaders at a SolrCloud is equal to number of shards?
Re: SolrCloud Leaders
Here writes something: https://support.lucidworks.com/entries/22180608-Solr-HA-DR-overview-3-x-and-4-0-SolrCloud-and says: Both leaders and replicas index items and perform searches. How replicas index items? 2013/4/15 Furkan KAMACI furkankam...@gmail.com Does leaders may response search requests (I mean do they store indexes) at when I run SolrCloud at first and after a time later? 2013/4/15 Jack Krupansky j...@basetechnology.com When the cluster is fully operational, yes. But if part of the cluster is down or split and unable to communicate, or leader election is in progress, the actual count of leaders will not be indicative of the number of shards. Leaders and shards are apples and oranges. If you take down a cluster, by definition it would have no leaders (because leaders are running code), but shards are the files in the index on disk that continue to exist even if the code is not running. So, in the extreme, the number of leaders can be zero while the number of shards is non-zero on disk. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Monday, April 15, 2013 8:21 AM To: solr-user@lucene.apache.org Subject: SolrCloud Leaders Does number of leaders at a SolrCloud is equal to number of shards?
Usage of CloudSolrServer?
I am reading Lucidworks Solr Guide it says at SolrCloud section: *Read Side Fault Tolerance* With earlier versions of Solr, you had to set up your own load balancer. Now each individual node load balances requests across the replicas in a cluster. You still need a load balancer on the 'outside' that talks to the cluster, or you need a smart client. (Solr provides a smart Java Solrj client called CloudSolrServer.) My system is as follows: I crawl data with Nutch and send them into SolrCloud. Users will search at Solr. What is that CloudSolrServer, should I use it for load balancing or is it something else different?
Re: Usage of CloudSolrServer?
Hi Shawn; I am sorry but what kind of Load Balancing is that? I mean does it check whether some leaders are using much CPU or RAM etc.? I think a problem may occur at such kind of scenario: if some of leaders getting more documents than other leaders (I don't know how it is decided that into which shard a document will go) than there will be a bottleneck on that leader? 2013/4/15 Shawn Heisey s...@elyograg.org On 4/15/2013 8:05 AM, Furkan KAMACI wrote: My system is as follows: I crawl data with Nutch and send them into SolrCloud. Users will search at Solr. What is that CloudSolrServer, should I use it for load balancing or is it something else different? It appears that the Solr integration in Nutch currently does not use CloudSolrServer. There is an issue to add it. The mutual dependency on HttpClient is holding it up - Nutch uses HttpClient 3, SolrJ 4.x uses HttpClient 4. https://issues.apache.org/**jira/browse/NUTCH-1377https://issues.apache.org/jira/browse/NUTCH-1377 Until that is fixed, a load balancer would be required for full redundancy for updates with SolrCloud. You don't have to use a load balancer for it to work, but if the Solr server that Nutch is using goes down, then indexing will stop unless you reconfigure Nutch or bring the Solr server back up. Thanks, Shawn
Re: Storing Solr Index on NFS
Hi Walter; You said: It is not safe to share Solr index files between two Solr servers. Why do you think like that? 2013/4/16 Tim Vaillancourt t...@elementspace.com If centralization of storage is your goal by choosing NFS, iSCSI works reasonably well with SOLR indexes, although good local-storage will always be the overall winner. I noticed a near 5% degredation in overall search performance (casual testing, nothing scientific) when moving a 40-50GB indexes to iSCSI (10GBe network) from a 4x7200rpm RAID 10 local SATA disk setup. Tim On 15/04/13 09:59 AM, Walter Underwood wrote: Solr 4.2 does have field compression which makes smaller indexes. That will reduce the amount of network traffic. That probably does not help much, because I think the latency of NFS is what causes problems. wunder On Apr 15, 2013, at 9:52 AM, Ali, Saqib wrote: Hello Walter, Thanks for the response. That has been my experience in the past as well. But I was wondering if there new are things in Solr 4 and NFS 4.1 that make the storing of indexes on a NFS mount feasible. Thanks, Saqib On Mon, Apr 15, 2013 at 9:47 AM, Walter Underwoodwunder@wunderwood.** org wun...@wunderwood.orgwrote: On Apr 15, 2013, at 9:40 AM, Ali, Saqib wrote: Greetings, Are there any issues with storing Solr Indexes on a NFS share? Also any recommendations for using NFS for Solr indexes? I recommend that you do not put Solr indexes on NFS. It can be very slow, I measured indexing as 100X slower on NFS a few years ago. It is not safe to share Solr index files between two Solr servers, so there is no benefit to NFS. wunder -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org
Re: Usage of CloudSolrServer?
Thanks for your detailed explanation. However you said: It will then choose one of those hosts/cores for each shard, and send a request to them as a distributed search request. Is there any document that explains of distributed search? What is the criteria for it? 2013/4/16 Upayavira u...@odoko.co.uk If you are accessing Solr from Java code, you will likely use the SolrJ client to do so. If your users are hitting Solr directly, you should think about whether this is wise - as well as providing them with direct search access, you are also providing them with the ability to delete your entire index with a single command. SolrJ isn't really a load balancer as such. When SolrJ is used to make a request against a collection, it will ask Zookeeper for the names of the shards that make up that collection, and for the hosts/cores that make up the set of replicas for those shards. It will then choose one of those hosts/cores for each shard, and send a request to them as a distributed search request. This has the advantage over traditional load balancing that if you bring up a new node, that node will register itself with ZooKeeper, and thus your SolrJ client(s) will know about it, without any intervention. Upayavira On Tue, Apr 16, 2013, at 08:36 AM, Furkan KAMACI wrote: Hi Shawn; I am sorry but what kind of Load Balancing is that? I mean does it check whether some leaders are using much CPU or RAM etc.? I think a problem may occur at such kind of scenario: if some of leaders getting more documents than other leaders (I don't know how it is decided that into which shard a document will go) than there will be a bottleneck on that leader? 2013/4/15 Shawn Heisey s...@elyograg.org On 4/15/2013 8:05 AM, Furkan KAMACI wrote: My system is as follows: I crawl data with Nutch and send them into SolrCloud. Users will search at Solr. What is that CloudSolrServer, should I use it for load balancing or is it something else different? It appears that the Solr integration in Nutch currently does not use CloudSolrServer. There is an issue to add it. The mutual dependency on HttpClient is holding it up - Nutch uses HttpClient 3, SolrJ 4.x uses HttpClient 4. https://issues.apache.org/**jira/browse/NUTCH-1377 https://issues.apache.org/jira/browse/NUTCH-1377 Until that is fixed, a load balancer would be required for full redundancy for updates with SolrCloud. You don't have to use a load balancer for it to work, but if the Solr server that Nutch is using goes down, then indexing will stop unless you reconfigure Nutch or bring the Solr server back up. Thanks, Shawn
Re: Some Questions About Using Solr as Cloud
Hi Erick; Thanks for the explanation. You said: You cannot transfer just the indexed form of a document from one core to another, you have to re-index the doc. why do you think like that? 2013/4/16 Erick Erickson erickerick...@gmail.com Yes. Every node is really self-contained. When you send a doc to a cluster where each shard has a replica, the raw doc is sent to each node of that shard and indexed independently. About old docs, it's the same as Solr 3.6. Data associated with docs stays around in the index until it's merged away. You cannot transfer just the indexed form of a document from one core to another, you have to re-index the doc. Best Erick On Mon, Apr 15, 2013 at 7:46 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi Jack; I see that SolrCloud makes everything automated. When I use SolrCloud is it true that: there may be more than one computer responsible for indexing at any time? 2013/4/15 Jack Krupansky j...@basetechnology.com There are no masters or slaves in SolrCloud - it's fully distributed. Some cluster nodes will be leaders (of the shard on that node) at a given point in time, but different nodes may be leaders at different points in time as they become elected. In a distributed cluster you would never want to store documents only on one node. Sure, you can do that by setting the replication factor to 1, but that defeats half the purpose for SolrCloud. Index transfer is automatic - SolrCloud supports fully distributed update. You might be getting confused with the old Master-Slave-Replication model that Solr had (and still has) which is distinct from SolrCloud. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Sunday, April 14, 2013 7:45 PM To: solr-user@lucene.apache.org Subject: Some Questions About Using Solr as Cloud I read wiki and reading SolrGuide of Lucidworks. However I want to clear something in my mind. Here are my questions: 1) Does SolrCloud lets a multi master design (is there any document that I can read about it)? 2) Let's assume that I use multiple cores i.e. core A and core B. Let's assume that there is a document just indexed at core B. If I send a search request to core A can I get result? 3) When I use multi master design (if exists) can I transfer one master's index data into another (with its slaves or not)? 4) When I use multi core design can I transfer one index data into another core or anywhere else? By the way thanks for the quick responses and kindness at mail list.
SolrCloud Leader Response Mechanism
When a leader responses for a query, does it says that: If I have the data what I am looking for, I should build response with it, otherwise I should find it anywhere. Because it may be long to search it? or does it says I only index the data, I will tell it to other guys to build up the response query?
Same Shards at Different Machines
Is it possible to use same shards at different machines at SolrCloud?
Re: SolrCloud Leader Response Mechanism
Hi Mark; When I speak with proper terms I want to ask that: is there a data locality of spatial locality ( http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_locality.html - I mean if you have data on your machine, use it and don't search it anywhere else, just search for remaining parts) at querying on a leader of SolrCloud? 2013/4/16 Mark Miller markrmil...@gmail.com Leaders don't have much to do with querying - the node that you query will determine what other nodes it has to query to search the whole index and do a scatter/gather for you. (Though in some cases that request can be proxied to another node) - Mark On Apr 16, 2013, at 7:48 AM, Furkan KAMACI furkankam...@gmail.com wrote: When a leader responses for a query, does it says that: If I have the data what I am looking for, I should build response with it, otherwise I should find it anywhere. Because it may be long to search it? or does it says I only index the data, I will tell it to other guys to build up the response query?
Why indexing and querying performance is better at SolrCloud compared to older versions of Solr?
Is there any document that describes why indexing and querying performance is better at SolrCloud compared to older versions of Solr? I was examining that architecture to use: there will be a cloud of Solr that just do indexing and there will be another cloud that copies that indexes into them and just to querying because of to get better performance. However if I use SolrCloud I think that there is no need to build up an architecture such like it.
Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase
Hi Otis and Jack; I have made a research about highlights and debugged code. I see that highlight are query dependent and not stored. Why Solr uses Lucene for storing text, I mean i.e. content of a web page. Is there any comparison about to store texts at Hbase or any other databases versus Lucene. Also I want to learn that is there anybody who has used anything else from Lucene to store text of document at our solr user list? 2013/4/11 Otis Gospodnetic otis.gospodne...@gmail.com Source code is your best bet. Wiki has info about how to use it, but not how highlighting is implemented. But you don't need to understand the implementation details to understand that they are dynamic, computed specifically for each query for each matching document, so you cannot store them anywhere ahead of time. Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, Apr 11, 2013 at 11:22 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi Otis; It seems that I should read more about highlights. Is there any where that explains in detail how highlights are generated at Solr? 2013/4/11 Otis Gospodnetic otis.gospodne...@gmail.com Hi, You can't store highlights ahead of time because they are query dependent. You could store documents in HBase and use Solr just for indexing. Is that what you want to do? If so, a custom SearchComponent executed after QueryComponent could fetch data from external store like HBase. I'm not sure if I'd recommend that. Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI furkankam...@gmail.com wrote: Actually I don't think to store documents at Solr. I want to store just highlights (snippets) at Hbase and I want to retrieve them from Hbase when needed. What do you think about separating just highlights from Solr and storing them into Hbase at Solrclod. By the way if you explain at which process and how highlights are genareted at Solr you are welcome. 2013/4/9 Otis Gospodnetic otis.gospodne...@gmail.com You may also be interested in looking at things like solrbase (on Github). Otis -- Solr ElasticSearch Support http://sematext.com/ On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI furkankam...@gmail.com wrote: Hi; First of all should mention that I am new to Solr and making a research about it. What I am trying to do that I will crawl some websites with Nutch and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 ) I wonder about something. I have a cloud of machines that crawls websites and stores that documents. Then I send that documents into SolrCloud. Solr indexes that documents and generates indexes and save them. I know that from Information Retrieval theory: it *may* not be efficient to store indexes at a NoSQL database (they are something like linked lists and if you store them in such kind of database you *may* have a sparse representation -by the way there may be some solutions for it. If you explain them you are welcome.) However Solr stores some documents too (i.e. highlights) So some of my documents will be doubled somehow. If I consider that I will have many documents, that dobuled documents may cause a problem for me. So is there any way not storing that documents at Solr and pointing to them at Hbase(where I save my crawled documents) or instead of pointing directly storing them at Hbase (is it efficient or not)?
When a search query comes to a replica what happens?
I want to make it clear in my mind: When a search query comes to a replica what happens? -Does it forwards the search query to leader and leader collects all the data and prepares response (this will cause a performance issue because leader is responsible for indexing at same time) or - replica communicates with leader and learns where is remaining data(leaders asks to Zookeper and tells it to replica) and replica collects all data and response it?
How SolrCloud Balance Number of Documents at each Shard?
Is it possible that different shards have different number of documents or does SolrCloud balance them? I ask this question because I want to learn the mechanism behind how Solr calculete hash value of the identifier of the document. Is it possible that hash function produces more documents into one of the shards other than any of shards. (because this may cause a bottleneck at some leaders of SolrCloud)
Re: When a search query comes to a replica what happens?
All in all will replica ask to its leader about where is remaining of data or it directly asks to Zookeper? 2013/4/17 Otis Gospodnetic otis.gospodne...@gmail.com Hi, No, I believe redirect from replica to leader would happen only at index time, so a doc first gets indexed to leader and from there it's replicated to non-leader shards. At query time there is no redirect to leader, I imagine, as that would quickly turn leaders into hotspots. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Apr 16, 2013 at 6:01 PM, Furkan KAMACI furkankam...@gmail.com wrote: I want to make it clear in my mind: When a search query comes to a replica what happens? -Does it forwards the search query to leader and leader collects all the data and prepares response (this will cause a performance issue because leader is responsible for indexing at same time) or - replica communicates with leader and learns where is remaining data(leaders asks to Zookeper and tells it to replica) and replica collects all data and response it?
Re: How SolrCloud Balance Number of Documents at each Shard?
Hi Otis; Firstly thanks for your answers. So do you mean that hashing mechanism will randomly route a document into a randomly shard? I want to ask it because I consider about putting a load balancer in front of my SolrCloud and manually route some documents into some other shards to avoid bottleneck. 2013/4/17 Otis Gospodnetic otis.gospodne...@gmail.com They won't be exact, but should be close. Are you seeing some *big* differences? Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Apr 16, 2013 at 6:11 PM, Furkan KAMACI furkankam...@gmail.com wrote: Is it possible that different shards have different number of documents or does SolrCloud balance them? I ask this question because I want to learn the mechanism behind how Solr calculete hash value of the identifier of the document. Is it possible that hash function produces more documents into one of the shards other than any of shards. (because this may cause a bottleneck at some leaders of SolrCloud)
Re: Pointing to Hbase for Docuements or Directly Saving Documents at Hbase
Thanks again for your answer. If I find any document about such comparisons that I would like to read. By the way, is there any advantage for using Lucene instead of anything else as like that: Using Lucene is naturally supported at Solr and if I use anything else I may face with some compatibility problems or communicating issues? 2013/4/17 Otis Gospodnetic otis.gospodne...@gmail.com People do use other data stores to retrieve data sometimes. e.g. Mongo is popular for that. Like I hinted in another email, I wouldn't necessarily recommend this for common cases. Don't do it unless you really know you need it. Otherwise, just store in Solr. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Apr 16, 2013 at 5:32 PM, Furkan KAMACI furkankam...@gmail.com wrote: Hi Otis and Jack; I have made a research about highlights and debugged code. I see that highlight are query dependent and not stored. Why Solr uses Lucene for storing text, I mean i.e. content of a web page. Is there any comparison about to store texts at Hbase or any other databases versus Lucene. Also I want to learn that is there anybody who has used anything else from Lucene to store text of document at our solr user list? 2013/4/11 Otis Gospodnetic otis.gospodne...@gmail.com Source code is your best bet. Wiki has info about how to use it, but not how highlighting is implemented. But you don't need to understand the implementation details to understand that they are dynamic, computed specifically for each query for each matching document, so you cannot store them anywhere ahead of time. Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, Apr 11, 2013 at 11:22 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi Otis; It seems that I should read more about highlights. Is there any where that explains in detail how highlights are generated at Solr? 2013/4/11 Otis Gospodnetic otis.gospodne...@gmail.com Hi, You can't store highlights ahead of time because they are query dependent. You could store documents in HBase and use Solr just for indexing. Is that what you want to do? If so, a custom SearchComponent executed after QueryComponent could fetch data from external store like HBase. I'm not sure if I'd recommend that. Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, Apr 11, 2013 at 10:01 AM, Furkan KAMACI furkankam...@gmail.com wrote: Actually I don't think to store documents at Solr. I want to store just highlights (snippets) at Hbase and I want to retrieve them from Hbase when needed. What do you think about separating just highlights from Solr and storing them into Hbase at Solrclod. By the way if you explain at which process and how highlights are genareted at Solr you are welcome. 2013/4/9 Otis Gospodnetic otis.gospodne...@gmail.com You may also be interested in looking at things like solrbase (on Github). Otis -- Solr ElasticSearch Support http://sematext.com/ On Sat, Apr 6, 2013 at 6:01 PM, Furkan KAMACI furkankam...@gmail.com wrote: Hi; First of all should mention that I am new to Solr and making a research about it. What I am trying to do that I will crawl some websites with Nutch and then I will index them with Solr. (Nutch 2.1, Solr-SolrCloud 4.2 ) I wonder about something. I have a cloud of machines that crawls websites and stores that documents. Then I send that documents into SolrCloud. Solr indexes that documents and generates indexes and save them. I know that from Information Retrieval theory: it *may* not be efficient to store indexes at a NoSQL database (they are something like linked lists and if you store them in such kind of database you *may* have a sparse representation -by the way there may be some solutions for it. If you explain them you are welcome.) However Solr stores some documents too (i.e. highlights) So some of my documents will be doubled somehow. If I consider that I will have many documents, that dobuled documents may cause a problem for me. So is there any way not storing that documents at Solr and pointing to them at Hbase(where I save my crawled documents) or instead of pointing directly storing them at Hbase (is it efficient or not)?
Re: SolrCloud Leader Response Mechanism
Hi Otis; You said: It can just do it because it knows where things are. Does it learn it from Zookeeper? 2013/4/17 Otis Gospodnetic otis.gospodne...@gmail.com If query comes to shard X on some node and this shard X is NOT a leader, but HAS data, it will just execute the query. If it needs to query shards on other nodes, it will have the info about which shards to query and will just do that and aggregate the results. It doesn't have to ask leader for permission, for info, etc. It can just do it because it knows where things are. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Apr 16, 2013 at 5:23 PM, Furkan KAMACI furkankam...@gmail.com wrote: Hi Mark; When I speak with proper terms I want to ask that: is there a data locality of spatial locality ( http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_locality.html - I mean if you have data on your machine, use it and don't search it anywhere else, just search for remaining parts) at querying on a leader of SolrCloud? 2013/4/16 Mark Miller markrmil...@gmail.com Leaders don't have much to do with querying - the node that you query will determine what other nodes it has to query to search the whole index and do a scatter/gather for you. (Though in some cases that request can be proxied to another node) - Mark On Apr 16, 2013, at 7:48 AM, Furkan KAMACI furkankam...@gmail.com wrote: When a leader responses for a query, does it says that: If I have the data what I am looking for, I should build response with it, otherwise I should find it anywhere. Because it may be long to search it? or does it says I only index the data, I will tell it to other guys to build up the response query?
Re: SolrCloud Leader Response Mechanism
Replica asks to Zookeper and Leader does not do anything. Thanks for your answer Otis. 2013/4/17 Otis Gospodnetic otis.gospodne...@gmail.com Oui, ZK holds the map. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Apr 16, 2013 at 6:33 PM, Furkan KAMACI furkankam...@gmail.com wrote: Hi Otis; You said: It can just do it because it knows where things are. Does it learn it from Zookeeper? 2013/4/17 Otis Gospodnetic otis.gospodne...@gmail.com If query comes to shard X on some node and this shard X is NOT a leader, but HAS data, it will just execute the query. If it needs to query shards on other nodes, it will have the info about which shards to query and will just do that and aggregate the results. It doesn't have to ask leader for permission, for info, etc. It can just do it because it knows where things are. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Apr 16, 2013 at 5:23 PM, Furkan KAMACI furkankam...@gmail.com wrote: Hi Mark; When I speak with proper terms I want to ask that: is there a data locality of spatial locality ( http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_locality.html - I mean if you have data on your machine, use it and don't search it anywhere else, just search for remaining parts) at querying on a leader of SolrCloud? 2013/4/16 Mark Miller markrmil...@gmail.com Leaders don't have much to do with querying - the node that you query will determine what other nodes it has to query to search the whole index and do a scatter/gather for you. (Though in some cases that request can be proxied to another node) - Mark On Apr 16, 2013, at 7:48 AM, Furkan KAMACI furkankam...@gmail.com wrote: When a leader responses for a query, does it says that: If I have the data what I am looking for, I should build response with it, otherwise I should find it anywhere. Because it may be long to search it? or does it says I only index the data, I will tell it to other guys to build up the response query?
Re: Storing Solr Index on NFS
I don't want to bother but I try to understand that part: When yo perform a commit in solr you have (for an instant) two versions of the index. The commit produces new segments (with new documents, new deletions, etc). After creating these new segments a new index searcher is created and its caches begin to autowarm. At this point the old index searcher that you were using is still active receiving requests. After the new index searcher finishes loading and autowarming the old searcher is discarded. So does it mean that when I have multiple Solr servers and a shared index, I should synchronize the caches at that different machines RAMs? 2013/4/17 Otis Gospodnetic otis.gospodne...@gmail.com Yesterday, we spent 1 hour with a client looking at their cluster's performance metrics SPM, their indexing logs, etc. trying to figure out why some indexing was slower than it should have been. We traced issues to network hickups, to VMs that would move from host to host, etc. Really fancy and powerful system in terms of hardware resources, but in the end a bit too far from just locally attached HDD or SDD that would not have issues like the ones we found. I'd stay away from NFS for the same reason - it's another moving part on the other side of the network. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Apr 16, 2013 at 7:15 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi Walter; You said: It is not safe to share Solr index files between two Solr servers. Why do you think like that? 2013/4/16 Tim Vaillancourt t...@elementspace.com If centralization of storage is your goal by choosing NFS, iSCSI works reasonably well with SOLR indexes, although good local-storage will always be the overall winner. I noticed a near 5% degredation in overall search performance (casual testing, nothing scientific) when moving a 40-50GB indexes to iSCSI (10GBe network) from a 4x7200rpm RAID 10 local SATA disk setup. Tim On 15/04/13 09:59 AM, Walter Underwood wrote: Solr 4.2 does have field compression which makes smaller indexes. That will reduce the amount of network traffic. That probably does not help much, because I think the latency of NFS is what causes problems. wunder On Apr 15, 2013, at 9:52 AM, Ali, Saqib wrote: Hello Walter, Thanks for the response. That has been my experience in the past as well. But I was wondering if there new are things in Solr 4 and NFS 4.1 that make the storing of indexes on a NFS mount feasible. Thanks, Saqib On Mon, Apr 15, 2013 at 9:47 AM, Walter Underwoodwunder@wunderwood. ** org wun...@wunderwood.orgwrote: On Apr 15, 2013, at 9:40 AM, Ali, Saqib wrote: Greetings, Are there any issues with storing Solr Indexes on a NFS share? Also any recommendations for using NFS for Solr indexes? I recommend that you do not put Solr indexes on NFS. It can be very slow, I measured indexing as 100X slower on NFS a few years ago. It is not safe to share Solr index files between two Solr servers, so there is no benefit to NFS. wunder -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org
Re: Push/pull model between leader and replica in one shard
Really nice presentation. 2013/4/17 Mark Miller markrmil...@gmail.com On Apr 16, 2013, at 1:36 AM, SuoNayi suonayi2...@163.com wrote: Hi, can someone explain more details about what model is used to sync docs between the lead and replica in the shard? The model can be push or pull.Supposing I have only one shard that has 1 leader and 2 replicas, when the leader receives a update request, does it will scatter the request to each available and active replica at first and then processes the request locally at last?In this case if the replicas are able to catch up with the leader can I think this is a push model that the leader pushes updates to it's replicas? Currently, the leader adds the doc locally and then sends it to all replicas concurrently. What happens if a replica is behind the leader?Will the replica pull docs from the leader and keep a track of the coming updates from the lead in a log(called tlog)?If so when it complete pulling docs it will replay updates in the tlog at last? If an update forwarded from a leader to a replica fails it's likely because that replica died. Just in case, the leader will ask that replica to enter recovery. When a node comes up and is not a leader, it also enters recovery. Recovery tries to peersync from the leader, and if that fails (works if off by about 100 updates), it replicates the entire index. If you are interested in more details on the SolrCloud architecture, I've given a few talks on it - two of them here: http://vimeo.com/43913870 http://www.youtube.com/watch?v=eVK0wLkLw9w - Mark
Re: Push/pull model between leader and replica in one shard
Hej Mark; What did you use to prepare your presentation, its really nice. 2013/4/17 Furkan KAMACI furkankam...@gmail.com Really nice presentation. 2013/4/17 Mark Miller markrmil...@gmail.com On Apr 16, 2013, at 1:36 AM, SuoNayi suonayi2...@163.com wrote: Hi, can someone explain more details about what model is used to sync docs between the lead and replica in the shard? The model can be push or pull.Supposing I have only one shard that has 1 leader and 2 replicas, when the leader receives a update request, does it will scatter the request to each available and active replica at first and then processes the request locally at last?In this case if the replicas are able to catch up with the leader can I think this is a push model that the leader pushes updates to it's replicas? Currently, the leader adds the doc locally and then sends it to all replicas concurrently. What happens if a replica is behind the leader?Will the replica pull docs from the leader and keep a track of the coming updates from the lead in a log(called tlog)?If so when it complete pulling docs it will replay updates in the tlog at last? If an update forwarded from a leader to a replica fails it's likely because that replica died. Just in case, the leader will ask that replica to enter recovery. When a node comes up and is not a leader, it also enters recovery. Recovery tries to peersync from the leader, and if that fails (works if off by about 100 updates), it replicates the entire index. If you are interested in more details on the SolrCloud architecture, I've given a few talks on it - two of them here: http://vimeo.com/43913870 http://www.youtube.com/watch?v=eVK0wLkLw9w - Mark
Solr Caching
I've just started to read about Solr caching. I want to learn one thing. Let's assume that I have given 4 GB RAM into my Solr application and I have 10 GB RAM. When Solr caching mechanism starts to work, does it use memory from that 4 GB part or lets operating system to cache it from 6 GB part of RAM that is remaining from Solr application?