MLT in SolrJ vs. URL?
Hi, I compose a mlt query in a URL and get the queried result back and a list of documents in the moreLikeThis section in my browser. When I try to execute the same query in SolrJ setting the same params, I only get the queried result document back and no MLT docs. What's the trick here? thanks, Darren
Re: zk Config URL?
(AbstractInhabitantImpl.java:78) at com.sun.enterprise.v3.server.AppServerStartup.run(AppServerStartup.java:253) at com.sun.enterprise.v3.server.AppServerStartup.doStart(AppServerStartup.java:145) at com.sun.enterprise.v3.server.AppServerStartup.start(AppServerStartup.java:136) at com.sun.enterprise.glassfish.bootstrap.GlassFishImpl.start(GlassFishImpl.java:79) at com.sun.enterprise.glassfish.bootstrap.GlassFishDecorator.start(GlassFishDecorator.java:63) at com.sun.enterprise.glassfish.bootstrap.osgi.OSGiGlassFishImpl.start(OSGiGlassFishImpl.java:69) at com.sun.enterprise.glassfish.bootstrap.GlassFishMain$Launcher.launch(GlassFishMain.java:117) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.sun.enterprise.glassfish.bootstrap.GlassFishMain.main(GlassFishMain.java:97) at com.sun.enterprise.glassfish.bootstrap.ASMain.main(ASMain.java:55) Caused by: java.lang.ClassNotFoundException: javax.servlet.Filter at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at sun.misc.Launcher$ExtClassLoader.findClass(Launcher.java:229) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 55 more On 02/24/2013 08:32 PM, Mark Miller wrote: You either have to specifically upload a config set or use one of the bootstrap sys props. Are you doing either? - Mark On Feb 24, 2013, at 8:15 PM, Darren Govoni dar...@ontrenet.com wrote: Thanks Michael. I went ahead and just started an external zookeeper, but my solr node throws exceptions from it. Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not find configName for collection collection1 found:null ... [#|2013-02-24T20:13:58.451-0500|SEVERE|glassfish3.1.2|org.apache.solr.core.CoreContainer|_ThreadID=28;_ThreadName=Thread-2;|null:org.apache.solr.common.SolrException: Unable to create core: collection1 at org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1654) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1039) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not find configName for collection collection1 found:null at org.apache.solr.cloud.ZkController.getConfName(ZkController.java:1097) at org.apache.solr.cloud.ZkController.createCollectionZkNode(ZkController.java:1016) at org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:937) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1031) ... 10 more On 02/24/2013 07:21 PM, Michael Della Bitta wrote: Hello Darren, If you go into the admin and click on Cloud, you'll see that information represented in a number of ways. Both Dump and Tree (especially the clusterstate.json file) have this information represented as a document in JSON format. If you don't see the Cloud navigation on the left side of the admin screen, that's a good indication that Solr hasn't connected to Zookeeper. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Sun, Feb 24, 2013 at 6:34 PM, Darren Govoni dar...@ontrenet.com wrote: Hi, I'm trying the latest solrcloud 4.1. Is there a button(or url) I can't find that shows me the zookeeper config XML, so I can check what other nodes are connected? Can't seem to find it. I deploy my solrcloud war into glassfish and set jetty.port (among other properties) to the GF domain port (e.g. 8181).' It starts successfully. I want zookeeper to run automatically within (as needed). How can I verify this or refer to the first/master server using zkHost from another node? (e.g. {host}:{port}) to form a cluster. I did this before a while ago, before solr 4.x was released, but things have changed. tips appreciated. thank you. Darren
zk Config URL?
Hi, I'm trying the latest solrcloud 4.1. Is there a button(or url) I can't find that shows me the zookeeper config XML, so I can check what other nodes are connected? Can't seem to find it. I deploy my solrcloud war into glassfish and set jetty.port (among other properties) to the GF domain port (e.g. 8181).' It starts successfully. I want zookeeper to run automatically within (as needed). How can I verify this or refer to the first/master server using zkHost from another node? (e.g. {host}:{port}) to form a cluster. I did this before a while ago, before solr 4.x was released, but things have changed. tips appreciated. thank you. Darren
Re: zk Config URL?
Thanks Michael. I went ahead and just started an external zookeeper, but my solr node throws exceptions from it. Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not find configName for collection collection1 found:null ... [#|2013-02-24T20:13:58.451-0500|SEVERE|glassfish3.1.2|org.apache.solr.core.CoreContainer|_ThreadID=28;_ThreadName=Thread-2;|null:org.apache.solr.common.SolrException: Unable to create core: collection1 at org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1654) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1039) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not find configName for collection collection1 found:null at org.apache.solr.cloud.ZkController.getConfName(ZkController.java:1097) at org.apache.solr.cloud.ZkController.createCollectionZkNode(ZkController.java:1016) at org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:937) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1031) ... 10 more On 02/24/2013 07:21 PM, Michael Della Bitta wrote: Hello Darren, If you go into the admin and click on Cloud, you'll see that information represented in a number of ways. Both Dump and Tree (especially the clusterstate.json file) have this information represented as a document in JSON format. If you don't see the Cloud navigation on the left side of the admin screen, that's a good indication that Solr hasn't connected to Zookeeper. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Sun, Feb 24, 2013 at 6:34 PM, Darren Govoni dar...@ontrenet.com wrote: Hi, I'm trying the latest solrcloud 4.1. Is there a button(or url) I can't find that shows me the zookeeper config XML, so I can check what other nodes are connected? Can't seem to find it. I deploy my solrcloud war into glassfish and set jetty.port (among other properties) to the GF domain port (e.g. 8181).' It starts successfully. I want zookeeper to run automatically within (as needed). How can I verify this or refer to the first/master server using zkHost from another node? (e.g. {host}:{port}) to form a cluster. I did this before a while ago, before solr 4.x was released, but things have changed. tips appreciated. thank you. Darren
RE: SolrJ and Solr 4.0 | doc.getFieldValue() returns String instead of Date
SimpleDateFormat df= new SimpleDateFormat(-MM-dd'T'hh:mm:ss.S'Z'); Date dateObj = df.parse(2009-10-29T00:00:009Z); brbrbr--- Original Message --- On 1/8/2013 09:34 AM uwe72 wrote:brA Lucene 4.0 document returns for a Date field now a string value, instead of bra Date object. br brfield name=ModuleImpl.versionAsDate view=Datenstand type=date br brSolr4.0 -- 2009-10-29T00:00:009Z brSolr3.6 -- Date instance br brCan this be set somewhere in the config? br brI prefer to receive a date instance br br br br-- brView this message in context: http://lucene.472066.n3.nabble.com/SolrJ-and-Solr-4-0-doc-getFieldValue-returns-String-instead-of-Date-tp4031588.html brSent from the Solr - User mailing list archive at Nabble.com. br
RE: RE: Max number of core in Solr multi-core
This should be clarified some. In the client API, SolrServer is represents a connection to a single server backend/endpoint and should be re-used where possible. The approach being discussed is to have one client connection (represented by SolrServer class) per solr core, all residing in a single solr server (as is the case below, but not required). brbrbr--- Original Message --- On 1/7/2013 08:06 AM Jay Parashar wrote:brThis is the exact approach we use in our multithreaded env. One server per brcore. I think this is the recommended approach. br br-Original Message- brFrom: Parvin Gasimzade [mailto:parvin.gasimz...@gmail.com] brSent: Monday, January 07, 2013 7:00 AM brTo: solr-user@lucene.apache.org brSubject: Re: Max number of core in Solr multi-core br brI know that but my question is different. Let me ask it in this way. br brI have a solr with base url localhost:8998/solr and two solr core as brlocalhost:8998/solr/core1 and localhost:8998/solr/core2. br brI have one baseSolr instance initialized as : brSolrServer server = new HttpSolrServer( url ); br brI have also create SolrServer's for each core as : brSolrServer core1 = new HttpSolrServer( url + /core1 ); SolrServer core2 = brnew HttpSolrServer( url + /core2 ); br brSince there are many cores, I have to initialize SolrServer as shown above. brIs there a way to create only one SolrServer with the base url and access breach core using it? If it is possible, then I don't need to create new brSolrServer for each core. br brOn Mon, Jan 7, 2013 at 2:39 PM, Erick Erickson brerickerick...@gmail.comwrote: br br This might help: br https://wiki.apache.org/solr/Solrj#HttpSolrServer br br Note that the associated SolrRequest takes the path, I presume br relative to the base URL you initialized the HttpSolrServer with. br br Best br Erick br br br On Mon, Jan 7, 2013 at 7:02 AM, Parvin Gasimzade br parvin.gasimz...@gmail.com br wrote: br br Thank you for your responses. I have one more question related to br Solr multi-core. br By using SolrJ I create new core for each application. When user br wants to add data or make query on his application, I create new br HttpSolrServer br for br this core. In this scenario there will be many running br HttpSolrServer instances. br br Is there a better solution? Does it cause a problem to run many br instances at the same time? br br On Wed, Jan 2, 2013 at 5:35 PM, Per Steffensen st...@designware.dk br wrote: br br g a collection per application instead of a core br br br br
Re: Terminology question: Core vs. Collection vs...
Yes. In that case, core should best be described as a logical solr entity with various managed attributes and qualities above the physical layer (sorry, not trying to perpetuate this thread so much). On 01/04/2013 01:55 PM, Mark Miller wrote: Currently a SolrCore is 1:1 with a low level Lucene index. There is no reason that needs to alway be that way. It's possible that we may at some point add built in micro sharding support that means a SolrCore could have multiple underlying Lucene indexes. Or we may not. - Mark On Jan 4, 2013, at 1:49 PM, darren dar...@ontrenet.com wrote: Good point. Agree. Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Upayavira u...@odoko.co.uk Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Using your terminology, I'd say core is a physical solr term, and index is a pysical lucene term. A collection or a shard is a logical solr term. Upayavira On Fri, Jan 4, 2013, at 06:28 PM, darren wrote: My understanding is core is a logical solr term. Index is a physical lucene term. A solr core is backed by a physical lucene index. One index per core. Solr team can correct me if its not accurate. :) Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Alexandre Rafalovitch arafa...@gmail.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Can I just start by saying that this was AMAZING. :-) When I asked the question, I certainly did not expect this level of details. And I vote on the cake diagram for WIKI as well. Perhaps, two with the first one showing the trivial collapsed state of single collection/shard/replica/core. The trivial one will also help to explain why the example is now called 'collection1'. I think I followed everything, except for just added term of 'index'. Isn't that the same as 'core'? Or can we have several indexes in one core? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote: This is the containment hierarchy i understand but includes both physical and logical. Original message From: darren dar...@ontrenet.com Date: To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Actually. Node/collection/shard/replica/core/index Original message From: darren dar...@ontrenet.com Date: To: yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Agreed. But for completeness can it be node/collection/shard/replica/core?
RE: Re: Terminology question: Core vs. Collection vs...
Good write up. And what about node? I think there needs to be an official glossary of terms that is sanctioned by the solr team and some terms still ni use may need to be labeled deprecated. After so many years, its still confusing. brbrbr--- Original Message --- On 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more modern term and incorporates the fact that the brcollection may be sharded, with each shard on one or more cores, with each brcore being a replica of the other cores within that shard of that brcollection. br brInstance is a general term, but is commonly used to refer to a running Solr brserver, each of which can service any number of cores. A sharded collection brwould typically require multiple instances of Solr, each with a shard of the brcollection. br brMultiple collections can be supported on a single instance of Solr. They brdon't have to be sharded or replicated. But if they are, each Solr instance brwill have a copy or replica of the data (index) of one shard of each sharded brcollection - to the degree that each collection needs that many shards. br brAt the API level, you talk to a Solr instance, using a host and port, and brgiving the collection name. Some operations will refer only to the portion brof a multi-shard collection on that Solr instance, but typically Solr will brdistribute the operation, whether it be an update or a query, to all of brthe shards of the named collection. In the case of update, the update will brbe distributed to all replicas as well, but in the case of query only one brreplica of each shard of the collection is needed. br brBefore SolrCloud we Solr had master and slave and the slaves were replicas brof the master, but with SolrCloud there is no master and all the replicas of brthe shard are peers, although at any moment of time one of them will be brconsidered the leader for coordination purposes, but not in the sense that brit is a master of the other replicas in that shard. A SolrCloud replica is a brreplica of the data, in an abstract sense, for a single shard of a brcollection. A SolrCloud replica is more of an instance of the data/index. br brAn index exists at two levels: the portion of a collection on a single Solr brcore will have a Lucene index, but collectively the Lucene indexes for the brshards of a collection can be referred to the index of the collection. Each brreplica is a copy or instance of a portion of the collection's index. br brThe term slice is sometimes used to refer collectively to all of the brcores/replicas of a single shard, or sometimes to a single replica as it brcontains only a slice of the full collection data. br br-- Jack Krupansky br br-Original Message- brFrom: Alexandre Rafalovitch brSent: Thursday, January 03, 2013 4:42 AM brTo: solr-user@lucene.apache.org brSubject: Terminology question: Core vs. Collection vs... br brHello, br brI am trying to understand the core Solr terminology. I am looking for brcorrect rather than loose meaning as I am trying to teach an example that brstarts from easy scenario and may scale to multi-core, multi-machine brsituation. br brHere are the terms that seem to be all overlapping and/or crossing over in brmy mind a the moment. br br1) Index br2) Core br3) Collection br4) Instance br5) Replica (Replica of _what_?) br6) Others? br brI tried looking through documentation, but either there is a terminology brdrift or I am having trouble understanding the distinctions. br brIf anybody has a clear picture in their mind, I would appreciate a brclarification. br brRegards, br Alex. br brPersonal blog: http://blog.outerthoughts.com/ brLinkedIn: http://www.linkedin.com/in/alexandrerafalovitch br- Time is the quality of nature that keeps events from happening all at bronce. Lately, it doesn't seem to be working. (Anonymous - via GTD book) br br
RE: Re: Terminology question: Core vs. Collection vs...
Thanks again. (And sorry to jump into this convo) But I had a question on your statement: On 1/3/2013 08:07 AM Jack Krupansky wrote: brCollection is the more modern term and incorporates the fact that the brcollection may be sharded, with each shard on one or more cores, with each brcore being a replica of the other cores within that shard of that brcollection. A collection is sharded, meaning it is distributed across cores. A shard itself is not distributed across cores in the same since. Rather a shard exist on a single core and is replicated on other cores. Is that right? The way its worded above, it sounds like a shard can also be sharded... brbrbr--- Original Message --- On 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a cluster or cloud (graph). It could be a real brmachine or a virtualized machine. Technically, you could have multiple brvirtual nodes on the same physical box. Each Solr replica would be on a brdifferent node. br brTechnically, you could have multiple Solr instances running on a single brhardware node, each with a different port. They are simply instances of brSolr, although you could consider each Solr instance a node in a Solr cloud bras well, a virtual node. So, technically, you could have multiple replicas bron the same node, but that sort of defeats most of the purpose of having brreplicas in the first place - to distribute the data for performance and brfault tolerance. But, you could have replicas of different shards on the brsame node/box for a partial improvement of performance and fault tolerance. br brA Solr cloud' is really a cluster. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:16 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brGood write up. br brAnd what about node? br brI think there needs to be an official glossary of terms that is sanctioned brby the solr team and some terms still ni use may need to be labeled brdeprecated. After so many years, its still confusing. br brbrbrbr--- Original Message --- brOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more modern brterm and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brbrcore being a replica of the other cores within that shard of that brbrcollection. brbr brbrInstance is a general term, but is commonly used to refer to a running brSolr brbrserver, each of which can service any number of cores. A sharded brcollection brbrwould typically require multiple instances of Solr, each with a shard of brthe brbrcollection. brbr brbrMultiple collections can be supported on a single instance of Solr. They brbrdon't have to be sharded or replicated. But if they are, each Solr brinstance brbrwill have a copy or replica of the data (index) of one shard of each brsharded brbrcollection - to the degree that each collection needs that many shards. brbr brbrAt the API level, you talk to a Solr instance, using a host and port, brand brbrgiving the collection name. Some operations will refer only to the brportion brbrof a multi-shard collection on that Solr instance, but typically Solr brwill brbrdistribute the operation, whether it be an update or a query, to all brof brbrthe shards of the named collection. In the case of update, the update brwill brbrbe distributed to all replicas as well, but in the case of query only brone brbrreplica of each shard of the collection is needed. brbr brbrBefore SolrCloud we Solr had master and slave and the slaves were brreplicas brbrof the master, but with SolrCloud there is no master and all the brreplicas of brbrthe shard are peers, although at any moment of time one of them will be brbrconsidered the leader for coordination purposes, but not in the sense brthat brbrit is a master of the other replicas in that shard. A SolrCloud replica bris a brbrreplica of the data, in an abstract sense, for a single shard of a brbrcollection. A SolrCloud replica is more of an instance of the brdata/index. brbr brbrAn index exists at two levels: the portion of a collection on a single brSolr brbrcore will have a Lucene index, but collectively the Lucene indexes for brthe brbrshards of a collection can be referred to the index of the collection. brEach brbrreplica is a copy or instance of a portion of the collection's index. brbr brbrThe term slice is sometimes used to refer collectively to all of the brbrcores/replicas of a single shard, or sometimes to a single replica as it brbrcontains only a slice of the full collection data. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Alexandre Rafalovitch brbrSent: Thursday, January 03, 2013 4:42 AM brbrTo: solr-user@lucene.apache.org brbrSubject: Terminology question: Core vs. Collection vs... brbr brbrHello, brbr brbrI am trying
RE: Re: Terminology question: Core vs. Collection vs...
Thanks. I got that part. A group of shards (and therefore cores) represent a collection, yes. But a single shard exist only on a single core? brbrbr--- Original Message --- On 1/3/2013 09:03 AM Jack Krupansky wrote:brNo, a shard is a subset (or slice) of the collection. Sharding is a way of brslicing the original data, before we talk about how the shards get stored brand replicated on actual Solr cores. Replicas are instances of the data for bra shard. br brSometimes people may loosely speak of a replica as being a shard, but brthat's just loose use of the terminology. br brSo, we're not sharding shards, but we are replicating shards. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:51 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brThanks again. (And sorry to jump into this convo) br brBut I had a question on your statement: br brOn 1/3/2013 08:07 AM Jack Krupansky wrote: br brCollection is the more modern term and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brcore being a replica of the other cores within that shard of that brbrcollection. br brA collection is sharded, meaning it is distributed across cores. A shard britself is not distributed across cores in the same since. Rather a shard brexist on a single core and is replicated on other cores. Is that right? The brway its worded above, it sounds like a shard can also be sharded... br br brbrbrbr--- Original Message --- brOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a brcluster or cloud (graph). It could be a real brbrmachine or a virtualized machine. Technically, you could have multiple brbrvirtual nodes on the same physical box. Each Solr replica would be on bra brbrdifferent node. brbr brbrTechnically, you could have multiple Solr instances running on a single brbrhardware node, each with a different port. They are simply instances of brbrSolr, although you could consider each Solr instance a node in a Solr brcloud brbras well, a virtual node. So, technically, you could have multiple brreplicas brbron the same node, but that sort of defeats most of the purpose of having brbrreplicas in the first place - to distribute the data for performance and brbrfault tolerance. But, you could have replicas of different shards on the brbrsame node/box for a partial improvement of performance and fault brtolerance. brbr brbrA Solr cloud' is really a cluster. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Darren Govoni brbrSent: Thursday, January 03, 2013 8:16 AM brbrTo: solr-user@lucene.apache.org brbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbr brbrGood write up. brbr brbrAnd what about node? brbr brbrI think there needs to be an official glossary of terms that is brsanctioned brbrby the solr team and some terms still ni use may need to be labeled brbrdeprecated. After so many years, its still confusing. brbr brbrbrbrbr--- Original Message --- brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more brmodern brbrterm and incorporates the fact that the brbrbrcollection may be sharded, with each shard on one or more cores, brwith brbreach brbrbrcore being a replica of the other cores within that shard of that brbrbrcollection. brbrbr brbrbrInstance is a general term, but is commonly used to refer to a brrunning brbrSolr brbrbrserver, each of which can service any number of cores. A sharded brbrcollection brbrbrwould typically require multiple instances of Solr, each with a brshard of brbrthe brbrbrcollection. brbrbr brbrbrMultiple collections can be supported on a single instance of Solr. brThey brbrbrdon't have to be sharded or replicated. But if they are, each Solr brbrinstance brbrbrwill have a copy or replica of the data (index) of one shard of each brbrsharded brbrbrcollection - to the degree that each collection needs that many brshards. brbrbr brbrbrAt the API level, you talk to a Solr instance, using a host and brport, brbrand brbrbrgiving the collection name. Some operations will refer only to the brbrportion brbrbrof a multi-shard collection on that Solr instance, but typically brSolr brbrwill brbrbrdistribute the operation, whether it be an update or a query, to brall brbrof brbrbrthe shards of the named collection. In the case of update, the brupdate brbrwill brbrbrbe distributed to all replicas as well, but in the case of query bronly brbrone brbrbrreplica of each shard of the collection is needed. brbrbr brbrbrBefore SolrCloud we Solr had master and slave and the slaves were brbrreplicas brbrbrof the master, but with SolrCloud there is no master and all the brbrreplicas of brbrbrthe shard are peers, although at any moment of time one of them will brbe brbrbrconsidered the leader
RE: Re: Terminology question: Core vs. Collection vs...
I think what's confusing about your explanation below is when you have a situation where there is no replication factor. That's possible too, yes? So in that case, is each core of a shard of a collection, still referred to as a replica? To me a replica is a duplicate/backup of a shard's core. Not the sharded core itself. Or is there just no difference. Even a non-replicated core is called a replica? brbrbr--- Original Message --- On 1/3/2013 09:08 AM Jack Krupansky wrote:brOops... let me word that a little more carefully: br br...we are replicating the data of each shard. br br br br br br-- Jack Krupansky br-Original Message- brFrom: Jack Krupansky brSent: Thursday, January 03, 2013 9:03 AM brTo: solr-user@lucene.apache.org brSubject: Re: Terminology question: Core vs. Collection vs... br brNo, a shard is a subset (or slice) of the collection. Sharding is a way of brslicing the original data, before we talk about how the shards get stored brand replicated on actual Solr cores. Replicas are instances of the data for bra shard. br brSometimes people may loosely speak of a replica as being a shard, but brthat's just loose use of the terminology. br brSo, we're not sharding shards, but we are replicating shards. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:51 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brThanks again. (And sorry to jump into this convo) br brBut I had a question on your statement: br brOn 1/3/2013 08:07 AM Jack Krupansky wrote: br brCollection is the more modern term and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brcore being a replica of the other cores within that shard of that brbrcollection. br brA collection is sharded, meaning it is distributed across cores. A shard britself is not distributed across cores in the same since. Rather a shard brexist on a single core and is replicated on other cores. Is that right? The brway its worded above, it sounds like a shard can also be sharded... br br brbrbrbr--- Original Message --- brOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a brcluster or cloud (graph). It could be a real brbrmachine or a virtualized machine. Technically, you could have multiple brbrvirtual nodes on the same physical box. Each Solr replica would be on bra brbrdifferent node. brbr brbrTechnically, you could have multiple Solr instances running on a single brbrhardware node, each with a different port. They are simply instances of brbrSolr, although you could consider each Solr instance a node in a Solr brcloud brbras well, a virtual node. So, technically, you could have multiple brreplicas brbron the same node, but that sort of defeats most of the purpose of having brbrreplicas in the first place - to distribute the data for performance and brbrfault tolerance. But, you could have replicas of different shards on the brbrsame node/box for a partial improvement of performance and fault brtolerance. brbr brbrA Solr cloud' is really a cluster. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Darren Govoni brbrSent: Thursday, January 03, 2013 8:16 AM brbrTo: solr-user@lucene.apache.org brbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbr brbrGood write up. brbr brbrAnd what about node? brbr brbrI think there needs to be an official glossary of terms that is brsanctioned brbrby the solr team and some terms still ni use may need to be labeled brbrdeprecated. After so many years, its still confusing. brbr brbrbrbrbr--- Original Message --- brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more brmodern brbrterm and incorporates the fact that the brbrbrcollection may be sharded, with each shard on one or more cores, brwith brbreach brbrbrcore being a replica of the other cores within that shard of that brbrbrcollection. brbrbr brbrbrInstance is a general term, but is commonly used to refer to a brrunning brbrSolr brbrbrserver, each of which can service any number of cores. A sharded brbrcollection brbrbrwould typically require multiple instances of Solr, each with a brshard of brbrthe brbrbrcollection. brbrbr brbrbrMultiple collections can be supported on a single instance of Solr. brThey brbrbrdon't have to be sharded or replicated. But if they are, each Solr brbrinstance brbrbrwill have a copy or replica of the data (index) of one shard of each brbrsharded brbrbrcollection - to the degree that each collection needs that many brshards. brbrbr brbrbrAt the API level, you talk to a Solr instance, using a host and brport, brbrand brbrbrgiving the collection name. Some operations will refer only to the brbrportion brbrbrof a multi-shard collection on that Solr instance, but typically brSolr brbrwill brbrbrdistribute the operation, whether it be an update
RE: Re: Terminology question: Core vs. Collection vs...
Yes. And its worth to note that when having multiple shards in a single node(@deprecated) that they are shards of different collections... brbrbr--- Original Message --- On 1/3/2013 09:16 AM Jack Krupansky wrote:brAnd I would revise node to note that in SolrCloud a node is simply an brinstance of a Solr server. br brAnd, technically, you can have multiple shards in a single instance of Solr, brseparating the logical sharding of keys from the distribution of the data. br br-- Jack Krupansky br br-Original Message- brFrom: Jack Krupansky brSent: Thursday, January 03, 2013 9:08 AM brTo: solr-user@lucene.apache.org brSubject: Re: Terminology question: Core vs. Collection vs... br brOops... let me word that a little more carefully: br br...we are replicating the data of each shard. br br br br br br-- Jack Krupansky br-Original Message- brFrom: Jack Krupansky brSent: Thursday, January 03, 2013 9:03 AM brTo: solr-user@lucene.apache.org brSubject: Re: Terminology question: Core vs. Collection vs... br brNo, a shard is a subset (or slice) of the collection. Sharding is a way of brslicing the original data, before we talk about how the shards get stored brand replicated on actual Solr cores. Replicas are instances of the data for bra shard. br brSometimes people may loosely speak of a replica as being a shard, but brthat's just loose use of the terminology. br brSo, we're not sharding shards, but we are replicating shards. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:51 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brThanks again. (And sorry to jump into this convo) br brBut I had a question on your statement: br brOn 1/3/2013 08:07 AM Jack Krupansky wrote: br brCollection is the more modern term and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brcore being a replica of the other cores within that shard of that brbrcollection. br brA collection is sharded, meaning it is distributed across cores. A shard britself is not distributed across cores in the same since. Rather a shard brexist on a single core and is replicated on other cores. Is that right? The brway its worded above, it sounds like a shard can also be sharded... br br brbrbrbr--- Original Message --- brOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a brcluster or cloud (graph). It could be a real brbrmachine or a virtualized machine. Technically, you could have multiple brbrvirtual nodes on the same physical box. Each Solr replica would be on bra brbrdifferent node. brbr brbrTechnically, you could have multiple Solr instances running on a single brbrhardware node, each with a different port. They are simply instances of brbrSolr, although you could consider each Solr instance a node in a Solr brcloud brbras well, a virtual node. So, technically, you could have multiple brreplicas brbron the same node, but that sort of defeats most of the purpose of having brbrreplicas in the first place - to distribute the data for performance and brbrfault tolerance. But, you could have replicas of different shards on the brbrsame node/box for a partial improvement of performance and fault brtolerance. brbr brbrA Solr cloud' is really a cluster. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Darren Govoni brbrSent: Thursday, January 03, 2013 8:16 AM brbrTo: solr-user@lucene.apache.org brbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbr brbrGood write up. brbr brbrAnd what about node? brbr brbrI think there needs to be an official glossary of terms that is brsanctioned brbrby the solr team and some terms still ni use may need to be labeled brbrdeprecated. After so many years, its still confusing. brbr brbrbrbrbr--- Original Message --- brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more brmodern brbrterm and incorporates the fact that the brbrbrcollection may be sharded, with each shard on one or more cores, brwith brbreach brbrbrcore being a replica of the other cores within that shard of that brbrbrcollection. brbrbr brbrbrInstance is a general term, but is commonly used to refer to a brrunning brbrSolr brbrbrserver, each of which can service any number of cores. A sharded brbrcollection brbrbrwould typically require multiple instances of Solr, each with a brshard of brbrthe brbrbrcollection. brbrbr brbrbrMultiple collections can be supported on a single instance of Solr. brThey brbrbrdon't have to be sharded or replicated. But if they are, each Solr brbrinstance brbrbrwill have a copy or replica of the data (index) of one shard of each brbrsharded brbrbrcollection - to the degree that each collection needs that many brshards. brbrbr brbrbrAt the API level, you talk to a Solr instance, using a host and brport, brbrand brbrbrgiving
RE: Re: Terminology question: Core vs. Collection vs...
Ah, ok. Good. Makes sense. I think I will draw all this up in a UML that includes the distinction between the logical terms and the physical terms (and their mapping) as they do get intertwined. I'll post it here when I'm done. brbrbr--- Original Message --- On 1/3/2013 09:19 AM Jack Krupansky wrote:brA single shard MAY exist on a single core, but only if it is not replicated. brGenerally, a single shard will exist on multiple cores, each a replica of brthe source data as it comes into the update handler. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 9:10 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brThanks. I got that part. br brA group of shards (and therefore cores) represent a collection, yes. But a brsingle shard exist only on a single core? br brbrbrbr--- Original Message --- brOn 1/3/2013 09:03 AM Jack Krupansky wrote:brNo, a shard is a subset (or brslice) of the collection. Sharding is a way of brbrslicing the original data, before we talk about how the shards get brstored brbrand replicated on actual Solr cores. Replicas are instances of the data brfor brbra shard. brbr brbrSometimes people may loosely speak of a replica as being a shard, but brbrthat's just loose use of the terminology. brbr brbrSo, we're not sharding shards, but we are replicating shards. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Darren Govoni brbrSent: Thursday, January 03, 2013 8:51 AM brbrTo: solr-user@lucene.apache.org brbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbr brbrThanks again. (And sorry to jump into this convo) brbr brbrBut I had a question on your statement: brbr brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote: brbr brCollection is the more modern term and incorporates the fact that brthe brbrbrcollection may be sharded, with each shard on one or more cores, brwith brbreach brcore being a replica of the other cores within that shard of brthat brbrbrcollection. brbr brbrA collection is sharded, meaning it is distributed across cores. A shard brbritself is not distributed across cores in the same since. Rather a shard brbrexist on a single core and is replicated on other cores. Is that right? brThe brbrway its worded above, it sounds like a shard can also be sharded... brbr brbr brbrbrbrbr--- Original Message --- brbrOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a brbrcluster or cloud (graph). It could be a real brbrbrmachine or a virtualized machine. Technically, you could have brmultiple brbrbrvirtual nodes on the same physical box. Each Solr replica would be bron brbra brbrbrdifferent node. brbrbr brbrbrTechnically, you could have multiple Solr instances running on a brsingle brbrbrhardware node, each with a different port. They are simply instances brof brbrbrSolr, although you could consider each Solr instance a node in a brSolr brbrcloud brbrbras well, a virtual node. So, technically, you could have multiple brbrreplicas brbrbron the same node, but that sort of defeats most of the purpose of brhaving brbrbrreplicas in the first place - to distribute the data for performance brand brbrbrfault tolerance. But, you could have replicas of different shards on brthe brbrbrsame node/box for a partial improvement of performance and fault brbrtolerance. brbrbr brbrbrA Solr cloud' is really a cluster. brbrbr brbrbr-- Jack Krupansky brbrbr brbrbr-Original Message- brbrbrFrom: Darren Govoni brbrbrSent: Thursday, January 03, 2013 8:16 AM brbrbrTo: solr-user@lucene.apache.org brbrbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbrbr brbrbrGood write up. brbrbr brbrbrAnd what about node? brbrbr brbrbrI think there needs to be an official glossary of terms that is brbrsanctioned brbrbrby the solr team and some terms still ni use may need to be labeled brbrbrdeprecated. After so many years, its still confusing. brbrbr brbrbrbrbrbr--- Original Message --- brbrbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the brmore brbrmodern brbrbrterm and incorporates the fact that the brbrbrbrcollection may be sharded, with each shard on one or more cores, brbrwith brbrbreach brbrbrbrcore being a replica of the other cores within that shard of brthat brbrbrbrcollection. brbrbrbr brbrbrbrInstance is a general term, but is commonly used to refer to a brbrrunning brbrbrSolr brbrbrbrserver, each of which can service any number of cores. A sharded brbrbrcollection brbrbrbrwould typically require multiple instances of Solr, each with a brbrshard of brbrbrthe brbrbrbrcollection. brbrbrbr brbrbrbrMultiple collections can be supported on a single instance of brSolr. brbrThey brbrbrbrdon't have to be sharded or replicated. But if they are, each brSolr brbrbrinstance brbrbrbrwill have a copy or replica of the data (index) of one
RE: Re: Terminology question: Core vs. Collection vs...
Great point. brbrbr--- Original Message --- On 1/3/2013 10:42 AM Per Steffensen wrote:brOn 1/3/13 4:33 PM, Mark Miller wrote: br This has pretty much become the standard across other distributed systems and in the literat…err…books. brHmmm Im not sure you are right about that. Maybe more than one brdistributed system calls them Replica, but there is also a lot that brdoesnt. But if you are right, thats at least a good valid argument to do brit this way, even though I generally prefer good logical naming over brfollowing bad naming from the industry :-) Just because there is a lot brof crap out there, doesnt mean that we also want to make crap. Maybe brgood logical naming could even be a small entry in the Why Solr is brbetter than its competitors list :-) br
RE: Re: Terminology question: Core vs. Collection vs...
And based on the previous explanation there is never a copy of a shard. A shard represents and contains only replicas for itself, replicas being copies of cores within the shard. brbrbr--- Original Message --- On 1/3/2013 11:58 AM Walter Underwood wrote:brA factor is multiplied, so multiplying the leader by a replicationFactor of 1 means you have exactly one copy of that shard. br brI think that recycling the term replication within Solr was confusing, but it is a bit late to change that. br brwunder br brOn Jan 3, 2013, at 7:33 AM, Mark Miller wrote: br br This has pretty much become the standard across other distributed systems and in the literat…err…books. br br I first implemented it as you mention you'd like, but Yonik correctly pointed out that we were going against the grain. br br - Mark br br On Jan 3, 2013, at 10:01 AM, Per Steffensen st...@designware.dk wrote: br br For the same reasons that Replica shouldnt be called Replica (it requires to long an explanation to agree that it is an ok name), replicationFactor shouldnt be called replicationFactor and long as it referes to the TOTAL number of cores you get for your Shard. replicationFactor would be an ok name if replicationFactor=0 meant one core, replicationFactor=1 meant two cores etc., but as long as replicationFactor=1 means one core, replicationFactor=2 means two cores, it is bad naming (you will not get any replication with replicationFactor=1 - WTF!?!?). If we want to insist that you specify the total number of cores at least use replicaPerShard instead of replicationFactor, or even better rename Replica to Shard-instance and use instancesPerShard instead of replicationFactor. br br Regards, Per Steffensen br br On 1/3/13 3:52 PM, Per Steffensen wrote: br Hi br br Here is my version - do not believe the explanations have been very clear br br We have the following concepts (here I will try to explain what each the concept cover without naming it - its hard) br 1) Machines (virtual or physical) running Solr server JVMs (one machine can run several Solr server JVMs if you like) br 2) Solr server JVMs br 3) Logical stores where you can add/update/delete data-instances (closest to logical tables in RDBMS) br 4) Logical slices of a store (closest to non-overlapping logical sets of rows for the logical table in a RDBMS) br 5) Physical instances of slices (a physical (disk/memory) instance of the a logical slice). This is where data actually goes on disk - the logical stores and slices above are just non-physical concepts br br Terminology br 1) Believe we have no name for this (except of course machine :-) ), even though Jack claims that this is called a node. Maybe sometimes it is called a node, but I believe node is more often used to refer to a Solr server JVM. br 2) Node br 3) Collection br 4) Shard. Used to be called Slice but I believe now it is officially called Shard. I agree with that change, because I believe most of the industry also uses the term Shard for this logical/non-physical concept - just needs to be reflected it across documentation and code br 5) Replica. Used to be called Shard but I believe now it is officially called Replica. I certainly do not agree with the name Replica, because it suggests that it is a copy of an original, but it isnt. I would prefer Shard-instance here, to avoid the confusion. I understand that you can argue (if you argue long enough) that Replica is a fine name, but you really need the explanation to understand why Replica can be defended as the name for this. Is is not immediately obvious what this is as long as it is called Replica. A Replica is basically a Solr Cloud managed Core and behind every Replica/Core lives a physical Lucene index. So Replica=Core) contains/maintains Lucene index behind the scenes. The term Replica also needs to be reflected across documentation and code. br br Regards, Per Steffensen br br br br-- brWalter Underwood brwun...@wunderwood.org br br br br
Re: Terminology question: Core vs. Collection vs...
I see. So sharding and distributing/replicating can have separate and different advantages. On 01/03/2013 01:06 PM, Lance Norskog wrote: Also, searching can be much faster if you put all of the shards on one machine, and the search distributor. That way, you search with multiple simultaneous threads inside one machine. I've seen this make searches several times faster. On 01/03/2013 06:36 AM, Jack Krupansky wrote: Ah... the multiple shards (of the same collection) in a single node is about planning for future expansion of your cluster - create more shards than you need today, put more of them on a single node and then migrate them to their own nodes as the data outgrows the smaller number of nodes. In other words, add nodes incrementally without having to reindex all the data. -- Jack Krupansky -Original Message- From: Darren Govoni Sent: Thursday, January 03, 2013 9:18 AM To: solr-user@lucene.apache.org Subject: RE: Re: Terminology question: Core vs. Collection vs... Yes. And its worth to note that when having multiple shards in a single node(@deprecated) that they are shards of different collections... brbrbr--- Original Message --- On 1/3/2013 09:16 AM Jack Krupansky wrote:brAnd I would revise node to note that in SolrCloud a node is simply an brinstance of a Solr server. br brAnd, technically, you can have multiple shards in a single instance of Solr, brseparating the logical sharding of keys from the distribution of the data. br br-- Jack Krupansky br br-Original Message- brFrom: Jack Krupansky brSent: Thursday, January 03, 2013 9:08 AM brTo: solr-user@lucene.apache.org brSubject: Re: Terminology question: Core vs. Collection vs... br brOops... let me word that a little more carefully: br br...we are replicating the data of each shard. br br br br br br-- Jack Krupansky br-Original Message- brFrom: Jack Krupansky brSent: Thursday, January 03, 2013 9:03 AM brTo: solr-user@lucene.apache.org brSubject: Re: Terminology question: Core vs. Collection vs... br brNo, a shard is a subset (or slice) of the collection. Sharding is a way of brslicing the original data, before we talk about how the shards get stored brand replicated on actual Solr cores. Replicas are instances of the data for bra shard. br brSometimes people may loosely speak of a replica as being a shard, but brthat's just loose use of the terminology. br brSo, we're not sharding shards, but we are replicating shards. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:51 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brThanks again. (And sorry to jump into this convo) br brBut I had a question on your statement: br brOn 1/3/2013 08:07 AM Jack Krupansky wrote: br brCollection is the more modern term and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brcore being a replica of the other cores within that shard of that brbrcollection. br brA collection is sharded, meaning it is distributed across cores. A shard britself is not distributed across cores in the same since. Rather a shard brexist on a single core and is replicated on other cores. Is that right? The brway its worded above, it sounds like a shard can also be sharded... br br brbrbrbr--- Original Message --- brOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a brcluster or cloud (graph). It could be a real brbrmachine or a virtualized machine. Technically, you could have multiple brbrvirtual nodes on the same physical box. Each Solr replica would be on bra brbrdifferent node. brbr brbrTechnically, you could have multiple Solr instances running on a single brbrhardware node, each with a different port. They are simply instances of brbrSolr, although you could consider each Solr instance a node in a Solr brcloud brbras well, a virtual node. So, technically, you could have multiple brreplicas brbron the same node, but that sort of defeats most of the purpose of having brbrreplicas in the first place - to distribute the data for performance and brbrfault tolerance. But, you could have replicas of different shards on the brbrsame node/box for a partial improvement of performance and fault brtolerance. brbr brbrA Solr cloud' is really a cluster. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Darren Govoni brbrSent: Thursday, January 03, 2013 8:16 AM brbrTo: solr-user@lucene.apache.org brbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbr brbrGood write up. brbr brbrAnd what about node? brbr brbrI think there needs to be an official glossary of terms that is brsanctioned brbrby the solr team and some terms still ni use may need to be labeled brbrdeprecated. After so many years, its still confusing. brbr brbrbrbrbr--- Original Message
RE: Does SolrCloud supports MoreLikeThis?
There is a ticket for that with some recent activity (sorry I don't have it handy right now), but I'm not sure if that work made it into the trunk, so probably solrcloud does not support MLT...yet. Would love an update from the dev team though! brbrbr--- Original Message --- On 11/5/2012 10:37 AM Luis Cappa Banda wrote:brThat´s the question, :-) br brRegards, br brLuis Cappa. br
Re: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime NRT available for download
It certainly seems to be a rogue project, but I can't understand the meaning of realtime near realtime (NRT) either. At best, its oxymoronic. On 10/29/2012 10:30 AM, Jack Krupansky wrote: Could any of the committers here confirm whether this is a legitimate effort? I mean, how could anything labeled Apache ABC with XYZ be an external project and be sanctioned/licensed by Apache? In fact, the linked web page doesn't even acknowledge the ownership of the Apache trademarks or ASL. And the term Realtime NRT is nonsensical. Even worse: Realtime NRT makes available a near realtime view. Equally nonsensical. Who knows, maybe it is legit, but it sure comes across as a scam/spam. -- Jack Krupansky -Original Message- From: Nagendra Nagarajayya Sent: Monday, October 29, 2012 10:06 AM To: solr-user@lucene.apache.org Subject: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime NRT available for download Hi! I am very excited to announce the availability of Apache Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime NRT. Realtime NRT is a high performance and more granular NRT implementation as to soft commit. The update performance is about 70,000 documents / sec* (almost 1.5-2x performance improvement over soft-commit). You can also scale up to 2 billion documents* in a single core, and query half a billion documents index in ms**. Realtime NRT is different from realtime-get. realtime-get does not have search capability and is a lookup by id. Realtime NRT allows full search, see here http://solr-ra.tgels.org/realtime-nrt.jsp for more info. Realtime NRT has been contributed back to Solr, see JIRA: https://issues.apache.org/jira/browse/SOLR-3816 RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ± and/or boolean/dismax/boost queries and is compatible with the new Lucene 4.0 api. You can get more information about Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime NRT performance from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x You can download Solr 4.0 with RankingAlgorithm 1.4.4 from here: http://solr-ra.tgels.org Please download and give the new version a try. Note: 1. Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external project Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org * performance is a real use case of Apache Solr with RankingAlgorithm as seen at a user installation ** performance seen when using the age feature
Re: Cloud terminology clarification
I agree it needs updating and I've always gotten confused at some point by the use (misuse) of terms. For example, the term 'node' is thrown around a lot too. What is it??! Hehe. On Sat, 2012-09-08 at 22:26 -0700, JesseBuesking wrote: It's been a while since the terminology at http://wiki.apache.org/solr/SolrTerminology has been updated, so I'm wondering how these terms apply to solr cloud setups. My take on what the terms mean: Collection: Basically the highest level container that bundles together the other pieces for servicing a particular search setup Core: An individual solr instance (represents entire indexes) Shard: A portion of a core (represents a subset of an index) Therefore: - increasing the number of shards allows for indexing more documents (aka scaling the amount of data that can be indexed) - increasing the number of cores increases the potential throughput of requests (aka cores mirror each other allowing you to distribute requests to multiple servers) Does this sound right? If so, then my follow up question would be does the following directory structure look right/standard? .../solr # = solr home .../solr/collection-01 .../solr/collection-01/core-01 .../solr/collection-01/core-02 And if this is right, I'm on a roll :D My next question would then be: Given we're using zookeeper (separate machine), do we need 1 conf folder at collection-01's level? Or do we need 1 conf folder per core? -- View this message in context: http://lucene.472066.n3.nabble.com/Cloud-terminology-clarification-tp4006407.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Map/Reduce directly against solr4 index.
Of course you can do it, but the question is whether this will produce the performance results you expect. I've seen talk about this in other forums, so you might find some prior work here. Solr and HDFS serve somewhat different purposes. The key issue would be if your map and reduce code overloads the Solr endpoint. Even using SolrCloud, I believe all requests will have to go through a single URL (to be routed), so if you have thousands of map/reduce jobs all running simultaneously, the question is whether your Solr is architected to handle that amount of throughput. On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote: Is it possible to run map reduce jobs directly on Solr4? I'm asking this because I want to use Solr4 as the primary storage engine. And I want to be able to run near real time analytics against it as well. Rather than export solr4 data out to a hadoop cluster.
Re: Map/Reduce directly against solr4 index.
You raise an interesting possibility. A map/reduce solr handler over solrcloud... On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote: I think the performance should be close to Hadoop running on HDFS, if somehow Hadoop job can directly read the Solr Index file while executing the job on the local solr node. Kindna like how HBase and Cassadra integrate with Hadoop. Plus, we can run the map reduce job on a standby Solr4 cluster. This way, the documents in Solr will be our primary source of truth. And we have the ability to run near real time search queries and analytics on it. No need to export data around. Solr4 is becoming a very interesting solution to many web scale problems. Just missing the map/reduce component. :) On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni dar...@ontrenet.com wrote: Of course you can do it, but the question is whether this will produce the performance results you expect. I've seen talk about this in other forums, so you might find some prior work here. Solr and HDFS serve somewhat different purposes. The key issue would be if your map and reduce code overloads the Solr endpoint. Even using SolrCloud, I believe all requests will have to go through a single URL (to be routed), so if you have thousands of map/reduce jobs all running simultaneously, the question is whether your Solr is architected to handle that amount of throughput. On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote: Is it possible to run map reduce jobs directly on Solr4? I'm asking this because I want to use Solr4 as the primary storage engine. And I want to be able to run near real time analytics against it as well. Rather than export solr4 data out to a hadoop cluster.
Re: [Announce] Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 with Realtime NRT available for download
What exactly is Realtime NRT (Near Real Time)? On Sun, 2012-07-22 at 14:07 -0700, Nagendra Nagarajayya wrote: Hi! I am very excited to announce the availability of Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 with Realtime NRT. The Realtime NRT implementation now supports both RankingAlgorithm and Lucene. Realtime NRT is a high performance and more granular NRT implementation as to soft commit. The update performance is about 70,000 documents / sec*. You can also scale up to 2 billion documents* in a single core, and query half a billion documents index in ms**. RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ± and/or boolean queries and is compatible with the new Lucene 4.0-ALPHA api. You can get more information about Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 Realtime performance from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x You can download Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 from here: http://solr-ra.tgels.org Please download and give the new version a try. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org * performance seen at a user installation of Solr 4.0 with RankingAlgorithm 1.4.3 ** performance seen when using the age feature
Re: Facet on all the dynamic fields with *_s feature
You'll have to query the index for the fields and sift out the _s ones and cache them or something. On Mon, 2012-07-16 at 16:52 +0530, Rajani Maski wrote: Yes, This feature will solve the below problem very neatly. All, Is there any approach to achieve this for now? --Rajani On Sun, Jul 15, 2012 at 6:02 PM, Jack Krupansky j...@basetechnology.comwrote: The answer appears to be No, but it's good to hear people express an interest in proposed features. -- Jack Krupansky -Original Message- From: Rajani Maski Sent: Sunday, July 15, 2012 12:02 AM To: solr-user@lucene.apache.org Subject: Facet on all the dynamic fields with *_s feature Hi All, Is this issue fixed in solr 3.6 or 4.0: Faceting on all Dynamic field with facet.field=*_s Link : https://issues.apache.org/**jira/browse/SOLR-247https://issues.apache.org/jira/browse/SOLR-247 If it is not fixed, any suggestion on how do I achieve this? My requirement is just same as this one : http://lucene.472066.n3.**nabble.com/Dynamic-facet-** field-tc2979407.html#nonehttp://lucene.472066.n3.nabble.com/Dynamic-facet-field-tc2979407.html#none Regards Rajani
Re: Solr Faceting
I don't think it comes at any added cost for solr to return that facet so you can filter it out in your business logic. On Sat, 2012-07-07 at 15:18 +0530, Shanu Jha wrote: Hi, I am generating facet for a field which has one of the value NA and I want solr should not create facet(or ignore) for this(NA) value. Is there any way to in solr to do that. Thanks
Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support
I don't recall anyone being able to get acceptable performance with a single index that large with solr/lucene. The conventional wisdom is that parallel searching across cores (or shards in SolrCloud) is the best way to handle index sizes in the illions. So its of great interest how you did. Anyone else gotten an index(es) with billions of documents to perform well? I'm greatly interested in how. On Mon, 2012-05-28 at 05:12 -0700, Nagendra Nagarajayya wrote: It is a single node. I am trying to find out if the performance can be referenced. Regarding information on Solr with RankingAlgorithm, you can find all the information here: http://solr-ra.tgels.org On RankingAlgorithm: http://rankingalgorithm.tgels.org Regards, - NN On 5/27/2012 4:50 PM, Li Li wrote: yes, I am also interested in good performance with 2 billion docs. how many search nodes do you use? what's the average response time and qps ? another question: where can I find related paper or resources of your algorithm which explains the algorithm in detail? why it's better than google site(better than lucene is not very interested because lucene is not originally designed to provide search function like google)? On Mon, May 28, 2012 at 1:06 AM, Darren Govonidar...@ontrenet.com wrote: I think people on this list would be more interested in your approach to scaling 2 billion documents than modifying solr/lucene scoring (which is already top notch). So given that, can you share any references or otherwise substantiate good performance with 2 billion documents? Thanks. On Sun, 2012-05-27 at 08:29 -0700, Nagendra Nagarajayya wrote: Actually, RankingAlgorithm 1.4.2 has been scaled to more than 2 billion docs. With RankingAlgorithm 1.4.3, using the parameters age=latestdocs=number feature, you can retrieve the NRT inserted documents in milliseconds from such a huge index improving query and faceting performance and using very little resources ... Currently, RankingAlgorithm 1.4.3 is only available with Solr 4.0, and the NRT insert performance with Solr 4.0 is about 70,000 docs / sec. RankingAlgorithm 1.4.3 should become available with Solr 3.6 soon. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 5/27/2012 7:32 AM, Darren Govoni wrote: Hi, Have you tested this with a billion documents? Darren On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote: Hi! I am very excited to announce the availability of Solr 3.6 with RankingAlgorithm 1.4.2. This NRT supports now works with both RankingAlgorithm and Lucene. The insert/update performance should be about 5000 docs in about 490 ms with the MbArtists Index. RankingAlgorithm 1.4.2 has multiple algorithms, improved performance over the earlier releases, supports the entire Lucene Query Syntax, ± and/or boolean queries and can scale to more than a billion documents. You can get more information about NRT performance from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here: http://solr-ra.tgels.org Please download and give the new version a try. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org ps. MbArtists index is the example index used in the Solr 1.4 Enterprise Book
Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support
Hi, Have you tested this with a billion documents? Darren On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote: Hi! I am very excited to announce the availability of Solr 3.6 with RankingAlgorithm 1.4.2. This NRT supports now works with both RankingAlgorithm and Lucene. The insert/update performance should be about 5000 docs in about 490 ms with the MbArtists Index. RankingAlgorithm 1.4.2 has multiple algorithms, improved performance over the earlier releases, supports the entire Lucene Query Syntax, ± and/or boolean queries and can scale to more than a billion documents. You can get more information about NRT performance from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here: http://solr-ra.tgels.org Please download and give the new version a try. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org ps. MbArtists index is the example index used in the Solr 1.4 Enterprise Book
Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support
I think people on this list would be more interested in your approach to scaling 2 billion documents than modifying solr/lucene scoring (which is already top notch). So given that, can you share any references or otherwise substantiate good performance with 2 billion documents? Thanks. On Sun, 2012-05-27 at 08:29 -0700, Nagendra Nagarajayya wrote: Actually, RankingAlgorithm 1.4.2 has been scaled to more than 2 billion docs. With RankingAlgorithm 1.4.3, using the parameters age=latestdocs=number feature, you can retrieve the NRT inserted documents in milliseconds from such a huge index improving query and faceting performance and using very little resources ... Currently, RankingAlgorithm 1.4.3 is only available with Solr 4.0, and the NRT insert performance with Solr 4.0 is about 70,000 docs / sec. RankingAlgorithm 1.4.3 should become available with Solr 3.6 soon. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 5/27/2012 7:32 AM, Darren Govoni wrote: Hi, Have you tested this with a billion documents? Darren On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote: Hi! I am very excited to announce the availability of Solr 3.6 with RankingAlgorithm 1.4.2. This NRT supports now works with both RankingAlgorithm and Lucene. The insert/update performance should be about 5000 docs in about 490 ms with the MbArtists Index. RankingAlgorithm 1.4.2 has multiple algorithms, improved performance over the earlier releases, supports the entire Lucene Query Syntax, ± and/or boolean queries and can scale to more than a billion documents. You can get more information about NRT performance from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here: http://solr-ra.tgels.org Please download and give the new version a try. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org ps. MbArtists index is the example index used in the Solr 1.4 Enterprise Book
SolrCloud war context name?
Hi, I am running my solrcloud nodes in an app server deployed into the context path 'solr' and zookeeper sees all of them. I want to deploy a second solrcloud war into the same app server (thus same IP:port) in a different context like 'solrrep' with the same config (cloned). Will this work? Or does zookeeper (or solrcloud leader) require all connected solr shards to have context url with ip:port/solr? Or will the correct URL be registered from the replica shard? thanks!
Re: SolrCloud war context name?
It's not really clear from the wiki how to use cores as shard replicas within the same solr server. In my mind, having a separate JVM/solr node/ acting as a replica makes sense because the replication traffic will be on a different channel in a different vm and won't interfere with search/indexing traffic on the primary shards. Or am I missing something easy about using cores with solr cloud? It was mentioned on the list recently that managing cores with solrcloud isn't really the best practice for it. On Sat, 2012-05-26 at 16:12 -0300, Marcelo Carvalho Fernandes wrote: Why not using multicore? Marcelo Carvalho Fernandes +55 21 8272-7970 On Sat, May 26, 2012 at 12:56 PM, Darren Govoni ontre...@ontrenet.comwrote: Hi, I am running my solrcloud nodes in an app server deployed into the context path 'solr' and zookeeper sees all of them. I want to deploy a second solrcloud war into the same app server (thus same IP:port) in a different context like 'solrrep' with the same config (cloned). Will this work? Or does zookeeper (or solrcloud leader) require all connected solr shards to have context url with ip:port/solr? Or will the correct URL be registered from the replica shard? thanks!
RE: Re: SolrCloud: how to index documents into a specific core and how to search against that core?
I'm curious what the solrcloud experts say, but my suggestion is to try not to over-engineering the search architecture on solrcloud. For example, what is the benefit of managing the what cores are indexed and searched? Having to know those details, in my mind, works against the automation in solrcore, but maybe there's a good reason you want to do it this way. brbrbr--- Original Message --- On 5/22/2012 07:35 AM Yandong Yao wrote:brHi Darren, br brThanks very much for your reply. br brThe reason I want to control core indexing/searching is that I want to bruse one core to store one customer's data (all customer share same brconfig): such as customer 1 use coreForCustomer1 and customer 2 bruse coreForCustomer2. br brIs there any better way than using different core for different customer? br brAnother way maybe use different collection for different customer, while brnot sure how many collections solr cloud could support. Which way is better brin terms of flexibility/scalability? (suppose there are tens of thousands brcustomers). br brRegards, brYandong br br2012/5/22 Darren Govoni dar...@ontrenet.com br br Why do you want to control what gets indexed into a core and then br knowing what core to search? That's the kind of knowing that SolrCloud br solves. In SolrCloud, it handles the distribution of documents across br shards and retrieves them regardless of which node is searched from. br That is the point of cloud, you don't know the details of where br exactly documents are being managed (i.e. they are cloudy). It can br change and re-balance from time to time. SolrCloud performs the br distributed search for you, therefore when you try to search a node/core br with no documents, all the results from the cloud are retrieved br regardless. This is considered A Good Thing. br br It requires a change in thinking about indexing and searching br br On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote: br Hi Guys, br br I use following command to start solr cloud according to solr cloud wiki. br br yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf br -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar br yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983 br -jar br start.jar br br Then I have created several cores using CoreAdmin API ( br http://localhost:8983/solr/admin/cores?action=CREATEname= br coreNamecollection=collection1), and clusterstate.json show following br topology: br br br collection1: br -- shard1: br-- collection1 br-- CoreForCustomer1 br-- CoreForCustomer3 br-- CoreForCustomer5 br -- shard2: br-- collection1 br-- CoreForCustomer2 br-- CoreForCustomer4 br br br 1) Index: br br Using following command to index mem.xml file in exampledocs directory. br br yydzero:exampledocs bjcoe$ java -Durl= br http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml br SimplePostTool: version 1.4 br SimplePostTool: POSTing files to br http://localhost:8983/solr/coreForCustomer3/update.. br SimplePostTool: POSTing file mem.xml br SimplePostTool: COMMITting Solr index changes. br br And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3', br 'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2 br core has 0 documents. br br *Question 1:* Is this expected behavior? How do I to index documents br into br a specific core? br br *Question 2*: If SolrCloud don't support this yet, how could I extend it br to support this feature (index document to particular core), where br should i br start, the hashing algorithm? br br *Question 3*: Why the documents are also indexed into 'coreForCustomer1' br and 'coreForCustomer5'? The default replica for documents are 1, right? br br Then I try to index some document to 'coreForCustomer2': br br $ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar br post.jar ipod_video.xml br br While 'coreForCustomer2' still have 0 documents and documents in br ipod_video br are indexed to core for customer 1/3/5. br br *Question 4*: Why this happens? br br 2) Search: I use br http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xml; to br search against 'CoreForCustomer2', while it will return all documents in br the whole collection even though this core has no documents at all. br br Then I use br br http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xmlshards=localhost:8983/solr/coreForCustomer2 br , br and it will return 0 documents. br br *Question 5*: So If want to search against a particular core, we need to br use 'shards' parameter and use solrCore name as parameter value, right? br br br Thanks very much in advance! br br Regards, br Yandong br br br br
Re: SolrCloud: how to index documents into a specific core and how to search against that core?
Why do you want to control what gets indexed into a core and then knowing what core to search? That's the kind of knowing that SolrCloud solves. In SolrCloud, it handles the distribution of documents across shards and retrieves them regardless of which node is searched from. That is the point of cloud, you don't know the details of where exactly documents are being managed (i.e. they are cloudy). It can change and re-balance from time to time. SolrCloud performs the distributed search for you, therefore when you try to search a node/core with no documents, all the results from the cloud are retrieved regardless. This is considered A Good Thing. It requires a change in thinking about indexing and searching On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote: Hi Guys, I use following command to start solr cloud according to solr cloud wiki. yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar Then I have created several cores using CoreAdmin API ( http://localhost:8983/solr/admin/cores?action=CREATEname= coreNamecollection=collection1), and clusterstate.json show following topology: collection1: -- shard1: -- collection1 -- CoreForCustomer1 -- CoreForCustomer3 -- CoreForCustomer5 -- shard2: -- collection1 -- CoreForCustomer2 -- CoreForCustomer4 1) Index: Using following command to index mem.xml file in exampledocs directory. yydzero:exampledocs bjcoe$ java -Durl= http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml SimplePostTool: version 1.4 SimplePostTool: POSTing files to http://localhost:8983/solr/coreForCustomer3/update.. SimplePostTool: POSTing file mem.xml SimplePostTool: COMMITting Solr index changes. And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3', 'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2 core has 0 documents. *Question 1:* Is this expected behavior? How do I to index documents into a specific core? *Question 2*: If SolrCloud don't support this yet, how could I extend it to support this feature (index document to particular core), where should i start, the hashing algorithm? *Question 3*: Why the documents are also indexed into 'coreForCustomer1' and 'coreForCustomer5'? The default replica for documents are 1, right? Then I try to index some document to 'coreForCustomer2': $ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar post.jar ipod_video.xml While 'coreForCustomer2' still have 0 documents and documents in ipod_video are indexed to core for customer 1/3/5. *Question 4*: Why this happens? 2) Search: I use http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xml; to search against 'CoreForCustomer2', while it will return all documents in the whole collection even though this core has no documents at all. Then I use http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xmlshards=localhost:8983/solr/coreForCustomer2;, and it will return 0 documents. *Question 5*: So If want to search against a particular core, we need to use 'shards' parameter and use solrCore name as parameter value, right? Thanks very much in advance! Regards, Yandong
Re: Distributed search between solrclouds?
The thought here is to distribute a search between two different solrcloud clusters and get ordered ranked results between them. It's possible? On Tue, 2012-05-15 at 18:47 -0400, Darren Govoni wrote: Hi, Would distributed search (the old way where you provide the solr host IP's etc.) still work between different solrclouds? thanks, Darren
Distributed search between solrclouds?
Hi, Would distributed search (the old way where you provide the solr host IP's etc.) still work between different solrclouds? thanks, Darren
Re: Documents With large number of fields
Was there a response to this? On Fri, 2012-05-04 at 10:27 -0400, Keswani, Nitin - BLS CTR wrote: Hi, My data model consist of different types of data. Each data type has its own characteristics If I include the unique characteristics of each type of data, my single Solr Document could end up containing 300-400 fields. In order to drill down to this data set I would have to provide faceting on most of these fields so that I can drilldown to very small set of Documents. Here are some of the questions : 1) What's the best approach when dealing with documents with large number of fields . Should I keep a single document with large number of fields or split my document into a number of smaller documents where each document would consist of some fields 2) From an operational point of view, what's the drawback of having a single document with a very large number of fields. Can Solr support documents with large number of fields (say 300 to 400). Thanks. Regards, Nitin Keswani
Re: Documents With large number of fields
I'm also interested in this. Same situation. On Fri, 2012-05-04 at 10:27 -0400, Keswani, Nitin - BLS CTR wrote: Hi, My data model consist of different types of data. Each data type has its own characteristics If I include the unique characteristics of each type of data, my single Solr Document could end up containing 300-400 fields. In order to drill down to this data set I would have to provide faceting on most of these fields so that I can drilldown to very small set of Documents. Here are some of the questions : 1) What's the best approach when dealing with documents with large number of fields . Should I keep a single document with large number of fields or split my document into a number of smaller documents where each document would consist of some fields 2) From an operational point of view, what's the drawback of having a single document with a very large number of fields. Can Solr support documents with large number of fields (say 300 to 400). Thanks. Regards, Nitin Keswani
SolrCloud indexing question
Hi, I just wanted to make sure I understand how distributed indexing works in solrcloud. Can I index locally at each shard to avoid throttling a central port? Or all the indexing has to go through a single shard leader? thanks
Re: SolrCloud indexing question
Gotcha. Now does that mean if I have 5 threads all writing to a local shard, will that shard piggyhop those index requests onto a SINGLE connection to the leader? Or will they spawn 5 connections from the shard to the leader? I really hope the formerthe latter won't scale well. On Fri, 2012-04-20 at 10:28 -0400, Jamie Johnson wrote: my understanding is that you can send your updates/deletes to any shard and they will be forwarded to the leader automatically. That being said your leader will always be the place where the index happens and then distributed to the other replicas. On Fri, Apr 20, 2012 at 7:54 AM, Darren Govoni dar...@ontrenet.com wrote: Hi, I just wanted to make sure I understand how distributed indexing works in solrcloud. Can I index locally at each shard to avoid throttling a central port? Or all the indexing has to go through a single shard leader? thanks
Re: Opposite to MoreLikeThis?
You could run the MLT for the document in question, then gather all those doc id's in the MLT results and negate those in a subsequent query. Not sure how robust that would work with very large result sets, but something to try. Another approach would be to gather the interesting terms from the document in question and then negate those terms in subsequent queries. Perhaps with many negated terms, Solr will rank the results based on most negated terms above less negated terms, simulating a ranked less like effect. On Fri, 2012-04-20 at 15:38 -0700, Charlie Maroto wrote: Hi all, Is there a way to implement the opposite to MoreLikeThis (LessLikeThis, I guess :). The requirement we have is to remove all documents with content like that of a given document id or a text provided by the end-user. In the current index implementation (not using Solr), the user can narrow results by indicating what document(s) are not relevant to him and then request to remove from the search results any document whose content is like that of the selected document(s) Our index has close to 100 million documents and they cover multiple topics that are not related to one another. So, a search for some broad terms may retrieve documents about engineering, agriculture, communications, etc. As the user is trying to discover the relevant documents, he may select an agriculture-related document to exclude it and those documents like it from the results set; same w/ engineering-like content, etc. until most of the documents are about communications. Of course, some exclusions may actually remove relevant content but those filters can be removed to go back to the previous set of results. Any ideas from similar implementations or suggestions are welcomed! Thanks, Carlos
Re: hierarchical faceting?
Put the parent term in all the child documents at index time and the re-issue the facet query when you expand the parent using the parent's term. works perfect. On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote: I have hierarchical colors: field name=colors type=text_pathindexed=true stored=true multiValued=true/ text_path is TextField with PathHierarchyTokenizerFactory as tokenizer. Given these two documents, Doc1: red Doc2: red/pink I want the result to be the following: ?fq=red == Doc1, Doc2 ?fq=red/pink == Doc2 But, with PathHierarchyTokenizer, Doc1 is included for the query: ?fq=red/pink == Doc1, Doc2 How can I query for hierarchical facets? http://wiki.apache.org/solr/HierarchicalFaceting describes facet.prefix.. But it looks too cumbersome to me. Is there a simpler way to implement hierarchical facets?
Re: hierarchical faceting?
I don't use any of that stuff in my app, so not sure how it works. I just manage my taxonomy outside of solr at index time and don't need any special fields or tokenizers. I use a string field type and insert the proper field at index time and query it normally. Nothing special required. On Wed, 2012-04-18 at 13:00 -0400, sam ” wrote: It looks like TextField is the problem. This fixed: fieldType name=text_path class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.PathHierarchyTokenizerFactory delimiter=// /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType I am assuming the text_path fields won't include whitespace characters. ?q=colors:red/pink == Doc2 (Doc1, which has colors = red isn't included!) Is there a tokenizer that tokenizes the string as one token? I tried to extend Tokenizer myself but it fails: public class AsIsTokenizer extends Tokenizer { @Override public boolean incrementToken() throws IOException { return true;//or false; } } On Wed, Apr 18, 2012 at 11:33 AM, sam ” skyn...@gmail.com wrote: Yah, that's exactly what PathHierarchyTokenizer does. fieldType name=text_path class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.PathHierarchyTokenizerFactory/ /analyzer /fieldType I think I have a query time tokenizer that tokenizes at / ?q=colors:red == Doc1, Doc2 ?q=colors:redfoobar == ?q=colors:red/foobarasdfoaijao == Doc1, Doc2 On Wed, Apr 18, 2012 at 11:10 AM, Darren Govoni dar...@ontrenet.comwrote: Put the parent term in all the child documents at index time and the re-issue the facet query when you expand the parent using the parent's term. works perfect. On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote: I have hierarchical colors: field name=colors type=text_pathindexed=true stored=true multiValued=true/ text_path is TextField with PathHierarchyTokenizerFactory as tokenizer. Given these two documents, Doc1: red Doc2: red/pink I want the result to be the following: ?fq=red == Doc1, Doc2 ?fq=red/pink == Doc2 But, with PathHierarchyTokenizer, Doc1 is included for the query: ?fq=red/pink == Doc1, Doc2 How can I query for hierarchical facets? http://wiki.apache.org/solr/HierarchicalFaceting describes facet.prefix.. But it looks too cumbersome to me. Is there a simpler way to implement hierarchical facets?
Re: Monitoring SolrCloud health
Can you be more specific about health? On Sat, 2012-04-14 at 00:03 -0400, Jamie Johnson wrote: How do people currently monitor the health of a solr cluster? Are there any good tools which can show the health across the entire cluster? Is this something which is planned for the new admin user interface?
RE: Realtime /get versus SearchHandler
Yes brbrbr--- Original Message --- On 4/13/2012 06:25 AM Benson Margulies wrote:brA discussion over on the dev list led me to expect that the by-if brfield retrievals in a SolrCloud query would come through the get brhandler. In fact, I've seen them turn up in my search component in the brsearch handler that is configured with my custom QT. (I have a br'prepare' method that sets ShardParams.QT to my QT to get my brprocessing involved in the first of the two queries.) Did I overthink brthis? br br
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
You could use SolrCloud (for the automatic scaling) and just mount a fuse[1] HDFS directory and configure solr to use that directory for its data. [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote: Hi, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Lastly, how much hardware (assuming a medium sized EC2 instance) would you estimate my needing with this setup, for regular web-data (HTML text) at this scale? Any architectural guidance would be greatly appreciated. The more details provided, the wider my grin :). Many many thanks in advance. Thanks, Safdar
RE: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
Solrcloud or any other tech specific replication isnt going to 'just work' with hadoop replication. But with some significant custom coding anything should be possible. Interesting idea. brbrbr--- Original Message --- On 4/12/2012 09:21 AM Ali S Kureishy wrote:brThanks Darren. br brActually, I would like the system to be homogenous - i.e., use Hadoop based brtools that already provide all the necessary scaling for the lucene index br(in terms of throughput, latency of writes/reads etc). Since SolrCloud adds brits own layer of sharding/replication that is outside Hadoop, I feel that brusing SolrCloud would be redundant, and a step in the opposite brdirection, which is what I'm trying to avoid in the first place. Or am I brmistaken? br brThanks, brSafdar br br brOn Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni dar...@ontrenet.com wrote: br br You could use SolrCloud (for the automatic scaling) and just mount a br fuse[1] HDFS directory and configure solr to use that directory for its br data. br br [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS br br On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote: br Hi, br br I'm trying to setup a large scale *Crawl + Index + Search *infrastructure br using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, br crawled + indexed every *4 weeks, *with a search latency of less than 0.5 br seconds. br br Needless to mention, the search index needs to scale to 5Billion pages. br It br is also possible that I might need to store multiple indexes -- one for br crawled content, and one for ancillary data that is also very large. Each br of these indices would likely require a logically distributed and br replicated index. br br However, I would like for such a system to be homogenous with the Hadoop br infrastructure that is already installed on the cluster (for the crawl). br In br other words, I would much prefer if the replication and distribution of br the br Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of br using another scalability framework (such as SolrCloud). In addition, it br would be ideal if this environment was flexible enough to be dynamically br scaled based on the size requirements of the index and the search traffic br at the time (i.e. if it is deployed on an Amazon cluster, it should be br easy br enough to automatically provision additional processing power into the br cluster without requiring server re-starts). br br However, I'm not sure which Solr-based tool in the Hadoop ecosystem would br be ideal for this scenario. I've heard mention of Solr-on-HBase, br Solandra, br Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these br is br mature enough and would be the right architectural choice to go along br with br a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling br aspects br above. br br Lastly, how much hardware (assuming a medium sized EC2 instance) would br you br estimate my needing with this setup, for regular web-data (HTML text) at br this scale? br br Any architectural guidance would be greatly appreciated. The more details br provided, the wider my grin :). br br Many many thanks in advance. br br Thanks, br Safdar br br br br
Re: I've broken delete in SolrCloud and I'm a bit clueless as to how
Hard to say why its not working for you. Start with a fresh Solr and work forward from there or back out your configs and plugins until it works again. On Tue, 2012-04-10 at 17:15 -0400, Benson Margulies wrote: In my cloud configuration, if I push delete query*:*/query /delete followed by: commit/ I get no errors, the log looks happy enough, but the documents remain in the index, visible to /query. Here's what seems my relevant bit of solrconfig.xml. My URP only implements processAdd. updateRequestProcessorChain name=RNI !-- some day, add parameters when we have some -- processor class=com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory/ processor class=solr.LogUpdateProcessorFactory / processor class=solr.DistributedUpdateProcessorFactory/ processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain !-- activate RNI processing by adding the RNI URP to the chain for xml updates -- requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.chainRNI/str /lst /requestHandler
RE: SOLR issue - too many search queries
My first reaction to your question is why are you running thousands of queries in a loop? Immediately, I think this will not scale well and the design probably needs to be re-visited. Second, if you need that many requests, then you need to seriously consider an architecture that supports it. This will require a complex design involving load balancers, multiple servers, replication, etc. People have achieved this with Solr, but it's beyond the scope of Solr itself to provide this, as its a matter of system architecture. Also, there are limits to the number of app server threads allowed, OS threads allowed, OS sockets, OS file descriptors, etc. etc. All of which need to be understood, designed for and configured properly. brbrbr--- Original Message --- On 4/10/2012 07:51 AM arunssasidhar wrote:brWe have a PHP web application which is using SOLR for searching. The APP is brusing CURL to connect to the SOLR server and which run in a loop with brthousands of predefined keywords. That will create thousands of different brsearch quires to SOLR at a given time. br brMy issue is that, when a single user logged into the app everything is brworking as expected. When there is more than one user is trying to run the brapp we are getting this response from the server. br brFailed to connect to xxx.xxx.xxx.xxx: Cannot assign requested braddressFailed to connect to xxx.xxx.xxx.xxx: Cannot assign requested braddressFailed br brOur assumption is that, SOLR server is unable to handle this much search brqueries at a given time. If so what is the solution to overcome this?. Is brthere any settings like keep-alive in SOLR? br brAny help would be highly appreciate. br brThanks, br brArun S br br br-- brView this message in context: http://lucene.472066.n3.nabble.com/SOLR-issue-too-many-search-queries-tp3899518p3899518.html brSent from the Solr - User mailing list archive at Nabble.com. br br
RE: Re: Cloud-aware request processing?
...it is a distributed real-time query scheme... SolrCloud does this already. It treats all the shards like one-big-index, and you can query it normally to get subset results from each shard. Why do you have to re-write the query for each shard? Seems unnecessary. brbrbr--- Original Message --- On 4/9/2012 08:45 AM Benson Margulies wrote:br Jan Høydahl, br brMy problem is intimately connected to Solr. it is not a batch job for brhadoop, it is a distributed real-time query scheme. I hate to add yet branother complex framework if a Solr RP can do the job simply. br brFor this problem, I can transform a Solr query into a subset query on breach shard, and then let the SolrCloud mechanism. br brI am well aware of the 'zoo' of alternatives, and I will be evaluating brthem if I can't get what I want from Solr. br brOn Mon, Apr 9, 2012 at 9:34 AM, Jan Høydahl jan@cominvent.com wrote: br Hi, br br Instead of using Solr, you may want to have a look at Hadoop or another framework for distributed computation, see e.g. http://java.dzone.com/articles/comparison-gridcloud-computing br br -- br Jan Høydahl, search solution architect br Cominvent AS - www.cominvent.com br Solr Training - www.solrtraining.com br br On 9. apr. 2012, at 13:41, Benson Margulies wrote: br br I'm working on a prototype of a scheme that uses SolrCloud to, in br effect, distribute a computation by running it inside of a request br processor. br br If there are N shards and M operations, I want each node to perform br M/N operations. That, of course, implies that I know N. br br Is that fact available anyplace inside Solr, or do I need to just configure it? br br br
Re: How to facet data from a multivalued field?
Your handler for that field should be looked at. Try not using a handler that tokenizes or stems the field. You want to leave the text as is. I forget the handler setting for that, but its documented in there somewhere. On Mon, 2012-04-09 at 13:02 -0700, Thiago wrote: Hello everybody, I've already searched about this topic in the forum, but I didn't find any case like this. I ask for apologizes if this topic have been already discussed. I'm having a problem in faceting a multivalued field. My field is called series, and it has names of TV series like the big bang theory, two and a half men ... In this field I can have a lot of TV series names. For example: arr name=series strTwo and a Half Men/str strHow I Met Your Mother/str strThe Big Bang Theory/str /arr What I want to do is: search and count how many documents related to each series. I'm doing it using facet search in this field. But it's returning each word separately. Like this: lst name=facet_counts lst name=facet_queries/ lst name=facet_fields lst name=series int name=bang91/int int name=big91/int int name=half21/int int name=how45/int int name=i45/int int name=men21/int int name=met45/int int name=mother45/int int name=theori91/int int name=two21/int int name=your45/int /lst /lst lst name=facet_dates/ lst name=facet_ranges/ /lst And what I want is something like: lst name=facet_counts lst name=facet_queries/ lst name=facet_fields lst name=series int name=Two and a Half Men21/int int name=How I Met Your Mother45/int int name=The Big Bang Theory91/int /lst /lst lst name=facet_dates/ lst name=facet_ranges/ /lst Is there any possible way to do it with facet search? I don't want the terms, I just want each string including the white spaces. Do I have to change my fieldtype to do this? Thanks to everybody. Thiago -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-facet-data-from-a-multivalued-field-tp3897853p3897853.html Sent from the Solr - User mailing list archive at Nabble.com.
No webadmin for trunk?
Hi, Just updated solr trunk and tried the java -jar start.jar and localhost:8983/solr/admin.not found. Where did it go? thanks.
Re: No webadmin for trunk?
HTTP ERROR: 404 Problem accessing /solr. Reason: Not Found Powered by Jetty:// On Sat, 2012-04-07 at 09:04 -0400, Jamie Johnson wrote: just go to localhost:8983/solr and you'll see the updated interface. On Sat, Apr 7, 2012 at 8:23 AM, Darren Govoni dar...@ontrenet.com wrote: Hi, Just updated solr trunk and tried the java -jar start.jar and localhost:8983/solr/admin.not found. Where did it go? thanks.
Re: No webadmin for trunk?
start.jar has no apps in it at all. On Sat, 2012-04-07 at 09:47 -0400, Darren Govoni wrote: HTTP ERROR: 404 Problem accessing /solr. Reason: Not Found Powered by Jetty:// On Sat, 2012-04-07 at 09:04 -0400, Jamie Johnson wrote: just go to localhost:8983/solr and you'll see the updated interface. On Sat, Apr 7, 2012 at 8:23 AM, Darren Govoni dar...@ontrenet.com wrote: Hi, Just updated solr trunk and tried the java -jar start.jar and localhost:8983/solr/admin.not found. Where did it go? thanks.
Re: No webadmin for trunk?
Yep. I did all kinds of ant clean, ant dist, ant example, etc. My trunk rev. At revision 1310773. Example start.jar is broken. No webapp inside. :( On Sat, 2012-04-07 at 16:11 +0200, Rafał Kuć wrote: Hello! Did you run 'ant example' ?
Re: No webadmin for trunk?
K. There is a solr.war in the webapps directory. But still get the 404. On Sat, 2012-04-07 at 16:19 +0200, Rafał Kuć wrote: Hello! start.jar shouldn't contain any webapp. If you look at the 'example' directory, you'll notice that there is a 'webapps' directory which should contain solr.war file. Btw. revision 1307647 works without a problem. I'll checkout trunk in a few in try with the newest revision.
Re: No webadmin for trunk?
Now, it comes up. Not sure why its acting weird. Will continue to look at it. On Sat, 2012-04-07 at 10:23 -0400, Darren Govoni wrote: K. There is a solr.war in the webapps directory. But still get the 404. On Sat, 2012-04-07 at 16:19 +0200, Rafał Kuć wrote: Hello! start.jar shouldn't contain any webapp. If you look at the 'example' directory, you'll notice that there is a 'webapps' directory which should contain solr.war file. Btw. revision 1307647 works without a problem. I'll checkout trunk in a few in try with the newest revision.
Re: upgrade 3.5 to 4.0
In my opinion, its never a good idea to overwrite files of a previous version with a new version. The easiest thing would be to just deploy the solr war file into tomcat and let tomcat manage the webapp, files, etc. On Sat, 2012-04-07 at 22:39 -0400, Dan Foley wrote: I have download the nightly snapshot of v 4.0 and would like to install it to my tomcat install of solr 3.5 can i simply overwrite the current files or is there a correct method for doing so? please advise.. thanks
Re: Does any one know when Solr 4.0 will be released.
No one knows. But if you ask the devs, they will say 'when its done'. One clue might be to monitor the bugs/issues scheduled for 4.0. When they are all resolved, then its ready. On Wed, 2012-04-04 at 09:41 -0700, srinivas konchada wrote: Hello every one Does any one know when Solr 4.0 will be released? there is a specific feature that exists in 4.0 which we want to take advantage off. Problem is we cannot deploy some thing into production from trunk. We need to use an official release. Thanks Srinivas Konchada
Re: Duplicates in Facets
Try using Luke to look at your index and see if there are multiple similar TFV's. You can browse them easily in Luke. On Wed, 2012-04-04 at 23:35 -0400, Jamie Johnson wrote: I am currently indexing some information and am wondering why I am getting duplicates in facets. From what I can tell they are the same, but is there any case that could cause this that I may not be thinking of? Could this be some non printable character making it's way into the index? Sample output from luke lst name=fields lst name=organization_umvs str name=typestring/str str name=schemaI--M---OFl/str str name=dynamicBase*_umvs/str str name=index(unstored field)/str int name=docs332/int int name=distinct-1/int lst name=topTerms int name=ORGANIZATION 1328/int int name=ORGANIZATION 2124/int int name=ORGANIZATION 236/int int name=ORGANIZATION 220/int int name=ORGANIZATION 34/int /lst
Custom scoring question
Hi, I have a situation I want to re-score document relevance. Let's say I have two fields: text: The quick brown fox jumped over the white fence. terms: fox fence Now my queries come in as: terms:[* TO *] and Solr scores them on that field. What I want is to rank them according to the distribution of field terms within field text. Which is a per document calculation. Can this be done with any kind of dismax? I'm not searching for known terms at query time. If not, what is the best way to implement a custom scoring handler to perform this calculation and re-score/sort the results? thanks for any tips!!!
Re: Custom scoring question
I'm going to try index time per-field boosting and do the boost computation at index time and see if that helps. On Thu, 2012-03-29 at 10:08 -0400, Darren Govoni wrote: Hi, I have a situation I want to re-score document relevance. Let's say I have two fields: text: The quick brown fox jumped over the white fence. terms: fox fence Now my queries come in as: terms:[* TO *] and Solr scores them on that field. What I want is to rank them according to the distribution of field terms within field text. Which is a per document calculation. Can this be done with any kind of dismax? I'm not searching for known terms at query time. If not, what is the best way to implement a custom scoring handler to perform this calculation and re-score/sort the results? thanks for any tips!!!
Re: Custom scoring question
Yeah, I guess that would work. I wasn't sure if it would change relative to other documents. But if it were to be combined with other fields, that approach may not work because the calculation wouldn't include the scoring for other parts of the query. So then you have the dynamic score and what to do with it. On Thu, 2012-03-29 at 16:29 -0300, Tomás Fernández Löbbe wrote: Can't you simply calculate that at index time and assign the result to a field, then sort by that field. On Thu, Mar 29, 2012 at 12:07 PM, Darren Govoni dar...@ontrenet.com wrote: I'm going to try index time per-field boosting and do the boost computation at index time and see if that helps. On Thu, 2012-03-29 at 10:08 -0400, Darren Govoni wrote: Hi, I have a situation I want to re-score document relevance. Let's say I have two fields: text: The quick brown fox jumped over the white fence. terms: fox fence Now my queries come in as: terms:[* TO *] and Solr scores them on that field. What I want is to rank them according to the distribution of field terms within field text. Which is a per document calculation. Can this be done with any kind of dismax? I'm not searching for known terms at query time. If not, what is the best way to implement a custom scoring handler to perform this calculation and re-score/sort the results? thanks for any tips!!!
MLT and solrcloud?
Hi, It was mentioned before that SolrCloud has all the capability of regular solr (including handlers) with the exception of the MLT handler. As this is a key capability for Solr, is there work planned to include the MLT in SolrCloud? If so when? Our efforts greatly depend on it. As such, I'm happy to help anyway possible. thanks, Darren
Re: MLT and solrcloud?
Ok, I'll do what I can to help! As always, appreciate the hard work Mark. On Thu, 2012-03-22 at 17:31 -0400, Mark Miller wrote: On Mar 22, 2012, at 5:22 PM, Darren Govoni wrote: Hi, It was mentioned before that SolrCloud has all the capability of regular solr (including handlers) with the exception of the MLT handler. As this is a key capability for Solr, is there work planned to include the MLT in SolrCloud? If so when? Our efforts greatly depend on it. As such, I'm happy to help anyway possible. thanks, Darren Usually no real time tables here :) Depends on who jumps in when. Some work has already gone on for this here: https://issues.apache.org/jira/browse/SOLR-788 You might just try and jump start that issue again? As I get a free moment or two, I'm happy to help commit a solution. - Mark Miller lucidimagination.com
RE: Re: maxClauseCount Exception
true. but how can you find documents containing that field without expanding 1000 clauses? brbrbr--- Original Message --- On 3/19/2012 07:24 AM Erick Erickson wrote:brbq: So all I want to do is a simple all docs with something in this field, brand to highlight the field br brBut that doesn't really make sense to do at the Solr/Lucene level. All bryou're saying is that you want that field highlighted. Wouldn't it be much breasier to just do this at the app level whenever your field had anything brreturned in it? br brBest brErick br brOn Sat, Mar 17, 2012 at 8:07 PM, Darren Govoni dar...@ontrenet.com wrote: br Thanks for the tip Hoss. br br I notice that it appears sometimes and was varying because my index runs br would sometimes have different amount of docs, etc. br br So all I want to do is a simple all docs with something in this field, br and to highlight the field. br br Is the query expansion to all possible terms in the index really br necessary? I could have 100's of thousands of possible terms. Why should br they all become explicit query elements? Seems overkill and br underperformant. br br Is there a another way with Lucene or not really? br br On Thu, 2012-03-08 at 16:18 -0800, Chris Hostetter wrote: br : I am suddenly getting a maxClauseCount exception for no reason. I am br : using Solr 3.5. I have only 206 documents in my index. br br Unless things have changed the reason you are seeing this is because br _highlighting_ a query (clause) like type_s:[*+TO+*] requires rewriting br it into a giant boolean query of all the terms in that field -- so even if br you only have 206 docs, if you have more then 206 values in that field in br your index, you're going to go over 1024 terms. br br (you don't get this problem in a basic query, because it doens't need to br enumerate all the terms, it rewrites it to a ConstantScoreQuery) br br what you most likeley want to do, is move some of those clauses like br type_s:[*+TO+*]: and usergroup_sm:admin) out of your main q query and br into fq filters ... so they can be cached independently, won't br contribute to scoring (just matching) and won't be used in highlighting. br br : params={hl=truehl.snippets=4hl.simple.pre=b/bfl=*,scorehl.mergeContiguous=truehl.usePhraseHighlighter=truehl.requireFieldMatch=trueechoParams=allhl.fl=text_tq={!lucene+q.op%3DOR+df%3Dtext_t}+(+kind_s:doc+OR+kind_s:xml)+AND+(type_s:[*+TO+*])+AND+(usergroup_sm:admin)rows=20start=0wt=javabinversion=2} hits=204 status=500 QTime=166 |#] br br : [#|2012-02-22T13:40:13.131-0500|SEVERE|glassfish3.1.1| br : org.apache.solr.servlet.SolrDispatchFilter| br : _ThreadID=22;_ThreadName=Thread-2;|org.apache.lucene.search.BooleanQuery br : $TooManyClauses: maxClauseCount is set to 1024 br : at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:136) br ... br : at br : org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:304) br : at br : org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158) br br -Hoss br br br br br
Re: Inconsistent Results with ZooKeeper Ensemble and Four SOLR Cloud Nodes
I think he's asking if all the nodes (same machine or not) return a response. Presumably you have different ports for each node since they are on the same machine. On Sun, 2012-03-18 at 14:44 -0400, Matthew Parker wrote: The cluster is running on one machine. On Sun, Mar 18, 2012 at 2:07 PM, Mark Miller markrmil...@gmail.com wrote: From every node in your cluster you can hit http://MACHINE1:8084/solr in your browser and get a response? On Mar 18, 2012, at 1:46 PM, Matthew Parker wrote: My cloud instance finally tried to sync. It looks like it's having connection issues, but I can bring the SOLR instance up in the browser so I'm not sure why it cannot connect to it. I got the following condensed log output: org.apache.commons.httpclient.HttpMethodDirector executeWithRetry I/O exception (java.net.ConnectException) caught when processing request: Connection refused: connect org.apache.commons.httpclient.HttpMethodDirector executeWithRetry I/O exception (java.net.ConnectException) caught when processing request: Connection refused: connect org.apache.commons.httpclient.HttpMethodDirector executeWithRetry I/O exception (java.net.ConnectException) caught when processing request: Connection refused: connect Retrying request shard update error StdNode: http://MACHINE1:8084/solr/:org.apache.solr.client.solrj.SolrServerException: http://MACHINE1:8084/solr at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java: 483) .. .. .. Caused by: java.net.ConnectException: Connection refused: connect at java.net.DualStackPlainSocketImpl.connect0(Native Method) .. .. .. try and ask http://MACHINE1:8084/solr to recover Could not tell a replica to recover org.apache.solr.client.solrj.SolrServerException: http://MACHINE1:8084/solr at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:483) ... ... ... Caused by: java.net.ConnectException: Connection refused: connect at java.net.DualStackPlainSocketImpl.waitForConnect(Native method) .. .. .. On Sat, Mar 17, 2012 at 10:10 PM, Mark Miller markrmil...@gmail.com wrote: Nodes talk to ZooKeeper as well as to each other. You can see the addresses they are trying to use to communicate with each other in the 'cloud' view of the Solr Admin UI. Sometimes you have to override these, as the detected default may not be an address that other nodes can reach. As a limited example: for some reason my mac cannot talk to my linux box with its default detected host address of halfmetal:8983/solr - but the mac can reach my linux box if I use halfmetal.Local - so I have to override the published address of my linux box using the host attribute if I want to setup a cluster between my macbook and linux box. Each nodes talks to ZooKeeper to learn about the other nodes, including their addresses. Recovery is then done node to node using the appropriate addresses. - Mark Miller lucidimagination.com On Mar 16, 2012, at 3:00 PM, Matthew Parker wrote: I'm still having issues replicating in my work environment. Can anyone explain how the replication mechanism works? Is it communicating across ports or through zookeeper to manager the process? On Thu, Mar 8, 2012 at 10:57 PM, Matthew Parker mpar...@apogeeintegration.com wrote: All, I recreated the cluster on my machine at home (Windows 7, Java 1.6.0.23, apache-solr-4.0-2012-02-29_09-07-30) , sent some document through Manifold using its crawler, and it looks like it's replicating fine once the documents are committed. This must be related to my environment somehow. Thanks for your help. Regards, Matt On Fri, Mar 2, 2012 at 9:06 AM, Erick Erickson erickerick...@gmail.comwrote: Matt: Just for paranoia's sake, when I was playing around with this (the _version_ thing was one of my problems too) I removed the entire data directory as well as the zoo_data directory between experiments (and recreated just the data dir). This included various index.2012 files and the tlog directory on the theory that *maybe* there was some confusion happening on startup with an already-wonky index. If you have the energy and tried that it might be helpful information, but it may also be a total red-herring FWIW Erick On Thu, Mar 1, 2012 at 8:28 PM, Mark Miller markrmil...@gmail.com wrote: I assuming the windows configuration looked correct? Yeah, so far I can not spot any smoking gun...I'm confounded at the moment. I'll re read through everything once more... - Mark
Re: maxClauseCount Exception
Thanks for the tip Hoss. I notice that it appears sometimes and was varying because my index runs would sometimes have different amount of docs, etc. So all I want to do is a simple all docs with something in this field, and to highlight the field. Is the query expansion to all possible terms in the index really necessary? I could have 100's of thousands of possible terms. Why should they all become explicit query elements? Seems overkill and underperformant. Is there a another way with Lucene or not really? On Thu, 2012-03-08 at 16:18 -0800, Chris Hostetter wrote: : I am suddenly getting a maxClauseCount exception for no reason. I am : using Solr 3.5. I have only 206 documents in my index. Unless things have changed the reason you are seeing this is because _highlighting_ a query (clause) like type_s:[*+TO+*] requires rewriting it into a giant boolean query of all the terms in that field -- so even if you only have 206 docs, if you have more then 206 values in that field in your index, you're going to go over 1024 terms. (you don't get this problem in a basic query, because it doens't need to enumerate all the terms, it rewrites it to a ConstantScoreQuery) what you most likeley want to do, is move some of those clauses like type_s:[*+TO+*]: and usergroup_sm:admin) out of your main q query and into fq filters ... so they can be cached independently, won't contribute to scoring (just matching) and won't be used in highlighting. : params={hl=truehl.snippets=4hl.simple.pre=b/bfl=*,scorehl.mergeContiguous=truehl.usePhraseHighlighter=truehl.requireFieldMatch=trueechoParams=allhl.fl=text_tq={!lucene+q.op%3DOR+df%3Dtext_t}+(+kind_s:doc+OR+kind_s:xml)+AND+(type_s:[*+TO+*])+AND+(usergroup_sm:admin)rows=20start=0wt=javabinversion=2} hits=204 status=500 QTime=166 |#] : [#|2012-02-22T13:40:13.131-0500|SEVERE|glassfish3.1.1| : org.apache.solr.servlet.SolrDispatchFilter| : _ThreadID=22;_ThreadName=Thread-2;|org.apache.lucene.search.BooleanQuery : $TooManyClauses: maxClauseCount is set to 1024 : at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:136) ... : at : org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:304) : at : org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158) -Hoss
RE: Solr 4.0 and production environments
As a rule of thumb, many will say not to go to production with a pre-release baseline. So until Solr4 goes final and stable, it's best not to assume too much about it. Second suggestion is to properly stage new technologies in your product such that they go through their own validation. And so to that end, jump right in and start using Solr4 and see for yourself! It's a great technology. brbrbr--- Original Message --- On 3/7/2012 11:47 AM Dirceu Vieira wrote:brHi All, br brHas anybody started using Solr 4.0 in production environments? Is it stable brenough? brI'm planning to create a proof of concept using solr 4.0, we have some brprojects that will gain a lot with features such as near real time search, brjoins and others, that are available only on version 4. br brIs it too risky to think of using it right now? brWhat are your thoughts and experiences with that? br brBest regards, br br-- brDirceu Vieira Júnior br--- br+47 9753 2473 brdirceuvjr.blogspot.com brtwitter.com/dirceuvjr br
Re: Building a resilient cluster
What I think was mentioned on this a bit ago is that the index stops working if one of the nodes goes down unless its a replica. You have 2 nodes running with numShards=2? Thus if one goes down the entire index is inoperable. In the future I'm hoping this changes such that the index cluster continues to operate but will lack results from the downed node. Maybe this has changed in recent trunk updates though. Not sure. On Mon, 2012-03-05 at 20:49 -0800, Ranjan Bagchi wrote: Hi Mark, So I tried this: started up one instance w/ zookeeper, and started a second instance defining a shard name in solr.xml -- it worked, searching would search both indices, and looking at the zookeeper ui, I'd see the second shard. However, when I brought the second server down -- the first one stopped working: it didn't kick the second shard out of the cluster. Any way to do this? Thanks, Ranjan From: Mark Miller markrmil...@gmail.com To: solr-user@lucene.apache.org Cc: Date: Wed, 29 Feb 2012 22:57:26 -0500 Subject: Re: Building a resilient cluster Doh! Sorry - this was broken - I need to fix the doc or add it back. The shard id is actually set in solr.xml since its per core - the sys prop was a sugar option we had setup. So either add 'shard' to the core in solr.xml, or to make it work like it does in the doc, do: core name=collection1 shard=${shard:} instanceDir=. / That sets shard to the 'shard' system property if its set, or as a default, act as if it wasn't set. I've been working with custom shard ids mainly through solrj, so I hadn't noticed this. - Mark On Wed, Feb 29, 2012 at 10:36 AM, Ranjan Bagchi ranjan.bag...@gmail.com wrote: Hi, At this point I'm ok with one zk instance being a point of failure, I just want to create sharded solr instances, bring them into the cluster, and be able to shut them down without bringing down the whole cluster. According to the wiki page, I should be able to bring up new shard by using shardId [-D shardId], but when I did that, the logs showed it replicating an existing shard. Ranjan Andre Bois-Crettez wrote: You have to run ZK on a at least 3 different machines for fault tolerance (a ZK ensemble). http://wiki.apache.org/solr/SolrCloud#Example_C:_Two_shard_cluster_with_sha= rd_replicas_and_zookeeper_ensemble Ranjan Bagchi wrote: Hi, I'm interested in setting up a solr cluster where each machine [at least initially] hosts a separate shard of a big index [too big to sit on the machine]. I'm able to put a cloud together by telling it that I have (to start out with) 4 nodes, and then starting up nodes on 3 machines pointin= g at the zkInstance. I'm able to load my sharded data onto each machine individually and it seems to work. My concern is that it's not fault tolerant: if one of the non-zookeeper machines falls over, the whole cluster won't work. Also, I can't create = a shard with more data, and have it work within the existing cloud. I tried using -DshardId=3Dshard5 [on an existing 4-shard cluster], but it just started replicating, which doesn't seem right. Are there ways around this? Thanks, Ranjan Bagchi -- - Mark http://www.lucidimagination.com
maxClauseCount error
Hi, I am suddenly getting a maxclause count error and don't know why. I am using Solr 3.5
maxClauseCount Exception
Hi, I am suddenly getting a maxClauseCount exception for no reason. I am using Solr 3.5. I have only 206 documents in my index. Any ideas? This is wierd. QUERY PARAMS: [hl, hl.snippets, hl.simple.pre, hl.simple.post, fl, hl.mergeContiguous, hl.usePhraseHighlighter, hl.requireFieldMatch, echoParams, hl.fl, q, rows, start]|#] [#|2012-02-22T13:40:13.129-0500|INFO|glassfish3.1.1| org.apache.solr.core.SolrCore|_ThreadID=22;_ThreadName=Thread-2;|[] webapp=/solr3 path=/select params={hl=truehl.snippets=4hl.simple.pre=b/bfl=*,scorehl.mergeContiguous=truehl.usePhraseHighlighter=truehl.requireFieldMatch=trueechoParams=allhl.fl=text_tq={!lucene+q.op%3DOR+df%3Dtext_t}+(+kind_s:doc+OR+kind_s:xml)+AND+(type_s:[*+TO+*])+AND+(usergroup_sm:admin)rows=20start=0wt=javabinversion=2} hits=204 status=500 QTime=166 |#] [#|2012-02-22T13:40:13.131-0500|SEVERE|glassfish3.1.1| org.apache.solr.servlet.SolrDispatchFilter| _ThreadID=22;_ThreadName=Thread-2;|org.apache.lucene.search.BooleanQuery $TooManyClauses: maxClauseCount is set to 1024 at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:136) at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:127) at org.apache.lucene.search.ScoringRewrite $1.addClause(ScoringRewrite.java:51) at org.apache.lucene.search.ScoringRewrite $1.addClause(ScoringRewrite.java:41) at org.apache.lucene.search.ScoringRewrite $3.collect(ScoringRewrite.java:95) at org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:38) at org.apache.lucene.search.ScoringRewrite.rewrite(ScoringRewrite.java:93) at org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:304) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:98) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:385) at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:217) at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:185) at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:205) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:490) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401) at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:131) at org.apache.so
Trunk build errors
Hi, I am getting numerous errors preventing a build of solrcloud trunk. [licenses] MISSING LICENSE for the following file: Any tips to get a clean build working? thanks
Re: SolrJ + SolrCloud
Thanks Mark. Is there any plan to make all the Solr search handlers work with SolrCloud, like MLT? That missing feature would prohibit us from using SolrCloud at the moment. :( On Sat, 2012-02-11 at 18:24 -0500, Mark Miller wrote: On Feb 11, 2012, at 6:02 PM, Darren Govoni wrote: Hi, Do all the normal facilities of Solr work with SolrCloud from SolrJ? Things like /mlt, /cluster, facets , tvf's, etc. Darren SolrJ works the same in SolrCloud mode as it does in non SolrCloud mode - it's fully supported. There is even a new SolrJ client called CloudSolrServer that has built in cluster awareness and load balancing. In terms of what is supported - anything that is supported with distributed search - that is most things, but there is the odd man out - like MLT - looks like an issue is open here: https://issues.apache.org/jira/browse/SOLR-788 but it's not resolved yet. - Mark Miller lucidimagination.com
SolrJ + SolrCloud
Hi, Do all the normal facilities of Solr work with SolrCloud from SolrJ? Things like /mlt, /cluster, facets , tvf's, etc. Darren
Re: Range facet - Count in facet menu != Count in search results
Double check your default operator for a faceted search vs. regular search. I caught this difference in my work that explained this difference. On Fri, 2012-02-10 at 07:45 -0800, Yuhao wrote: Jay, Was the curly closing bracket } intentional? I'm using 3.4, which also supports fq=price:[10 TO 20]. The problem is the results are not working properly. From: Jan Høydahl jan@cominvent.com To: solr-user@lucene.apache.org; Yuhao nfsvi...@yahoo.com Sent: Thursday, February 9, 2012 7:45 PM Subject: Re: Range facet - Count in facet menu != Count in search results Hi, If you use trunk (4.0) version, you can say fq=price:[10 TO 20} and have the upper bound be exclusive. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 10. feb. 2012, at 00:58, Yuhao wrote: I've changed the facet.range.include option to every possible value (lower, upper, edge, outer, all)**. It only changes the count shown in the Ranges facet menu on the left. It has no effect on the count and results shown in search results, which ALWAYS is inclusive of both the lower AND upper bounds (which is equivalent to include = all). Is this by design? I would like to make the search results include the lower bound, but not the upper bound. Can I do that? My range field is multi-valued, but I don't think that should be the problem. ** Actually, it doesn't like outer for some reason, which leaves the facet completely empty.
Re: SolrCloud war?
UPDATE: I set my app server[1] system property jetty.port to be equal to the app servers open port and was able to get two Solr shards to talk. The overall properties I set are: App server domain 1: bootstrap_confdir collection.configName jetty.port solr.solr.home zkRun App server domain 2: bootstrap_confdir collection.configName jetty.port solr.solr.home zkHost I deployed each war app into the /solr context. I presume its needed by remote URL addressing. I checked the zookeeper config page and it shows both shards. Awesome. [1] Glassfish 3.1.1 On 02/01/2012 08:50 PM, Mark Miller wrote: I have not yet tried to run SolrCloud in another app server, but it shouldn't be a problem. One issue you might have is the fact that we count on hostPort coming from the system property jetty.port. This is set in the default solr.xml - the hostPort defaults to jetty.port. You probably want to explicitly pass -DhostPort= if you are not going to use jetty.port. - Mark Miller lucidimagination.com On Feb 1, 2012, at 2:44 PM, Darren Govoni wrote: Hi, I'm trying to get the SolrCloud2 examples to work using a war deployed solr into glassfish. The startup properties must be different in this case, because its having trouble connecting to zookeeper when I deploy the solr war file. Perhaps the embedded zookeeper has trouble running in an app server? Any tips appreciated! Darren On 01/30/2012 06:58 PM, Darren Govoni wrote: Hi, Is there any issue with running the new SolrCloud deployed as a war in another app server? Has anyone tried this yet? thanks.
Re: Federation in SolrCloud?
Thanks for the reply Mark. I did example A. One of the instances had zookeeper. If I shut down the other instance, all searches on the other (running) instance produced an error in the browser. I don't have the error handy but it was one line. Something like missing shard in collection IIRC. What I'm hoping to achieve is this. Shard A: DocA, DocB Shard B: DocC, DocD if I do a query with both shards running I get DocA,DocB,DocC,DocD. If Shard B goes down, I only get DocA, DocB. After that I will fold replication in to understand it. On 02/02/2012 04:22 PM, Mark Miller wrote: On Feb 2, 2012, at 9:51 AM, dar...@ontrenet.com wrote: Hi, I want to use SolrCloud in a more federated mode rather than replication. The failover is nice, but I am more interested in increasing capacity of an index through horizontal scaling (shards). How can I configure shards such that they retain their own documents and don't replicate (or replicate to some shards and not all)? Thus, when I search from any shard I want results from all shards (being different results from each). Currently, if I kill a shard (using the example provided), no search works and it errors out. thanks! What example are you trying? Are you following it exactly? In order to serve requests at least one instance has to be up for every shard - but what you describe is how things work if you have enough replicas. Example A splits the index across two shards, but there are no replicas - if an instance goes down, search will not work. Example B and C add replicas. This means that one instance can die per shard and you will still be able to serve requests. Keep in mind that if you are running ZooKeeper with Solr (as the examples do), you have to make sure at least half the nodes running ZooKeeper are up. If that is only one node, you cannot kill that node - it will be a single point of failure unless you create a ZooKeeper ensemble. - Mark Miller lucidimagination.com
Re: SolrCloud war?
Hi, I'm trying to get the SolrCloud2 examples to work using a war deployed solr into glassfish. The startup properties must be different in this case, because its having trouble connecting to zookeeper when I deploy the solr war file. Perhaps the embedded zookeeper has trouble running in an app server? Any tips appreciated! Darren On 01/30/2012 06:58 PM, Darren Govoni wrote: Hi, Is there any issue with running the new SolrCloud deployed as a war in another app server? Has anyone tried this yet? thanks.
SolrCloud war?
Hi, Is there any issue with running the new SolrCloud deployed as a war in another app server? Has anyone tried this yet? thanks.
Re: Hierarchical faceting in UI
Yuhao, Ok, let me think about this. A term can have multiple parents. Each of those parents would be 'different', yes? In this case, use a multivalued field for the parent and add all the parent names or id's to it. The relations should be unique. Your UI will associate the correct parent id to build the facet query from and return the correct children because the user is descending down a specific path in the UI and the parent node unique id's are returned along the way. Now, if you are having parent names/id's that themselves can appear in multiple locations (vs. just terms 'the leafs'), then perhaps your hierarchy needs refactoring for redundancy? Happy to help with more details. Darren On 01/24/2012 11:22 AM, Yuhao wrote: Darren, One challenge for me is that a term can appear in multiple places of the hierarchy. So it's not safe to simply use the term as it appears to get its children; I probably need to include the entire tree path up to this term. For example, if the hierarchy is Cardiovascular Diseases Arteriosclerosis Coronary Artery Disease, and I'm getting the children of the middle term Arteriosclerosi, I need to filter on something like parent:Cardiovascular Diseases/Arteriosclerosis. I'm having trouble figuring out how I can get the complete path per above to add to the URL of each facet term. I know velocity/facet_field.vm is where I build the URL. I know how to simply add a parent:term filter to the URL. But I don't know how to access a document field, like the complete parent path, in facet_field.vm. Any help would be great. Yuhao From: dar...@ontrenet.comdar...@ontrenet.com To: Yuhaonfsvi...@yahoo.com Cc: solr-user@lucene.apache.org Sent: Monday, January 23, 2012 7:16 PM Subject: Re: Hierarchical faceting in UI On Mon, 23 Jan 2012 14:33:00 -0800 (PST), Yuhaonfsvi...@yahoo.com wrote: Programmatically, something like this might work: for each facet field, add another hidden field that identifies its parent. Then, program additional logic in the UI to show only the facet terms at the currently selected level. For example, if one filters on cat:electronics, the new UI logic would apply the additional filter cat_parent:electronics. Can this be done? Yes. This is how I do it. Would it be a lot of work? No. Its not a lot of work, simply represent your hierarchy as parent/child relations in the document fields and in your UI drill down by issuing new faceted searches. Use the current facet (tree level) as the parent:level in the next query. Its much easier than other suggestions for this. Is there a better way? Not in my opinion, there isn't. This is the simplest to implement and understand. By the way, Flamenco (another faceted browser) has built-in support for hierarchies, and it has worked well for my data in this aspect (but less well than Solr in others). I'm looking for the same kind of hierarchical UI feature in Solr.
Re: How to accelerate your Solr-Lucene appication by 4x
I think the occassional Hey, we made something cool you might be interested in! notice, even if commercial, is ok because it addresses numerous issues we struggle with on this list. Now, if it were something completely off-base or unrelated (e.g. male enhancement pills), then yeah, I agree. On 01/18/2012 11:08 PM, Steven A Rowe wrote: Hi Darren, I think it's rare because it's rare: if this were found to be a useful advertising space, rare would cease to be descriptive of it. But I could be wrong. Steve -Original Message- From: Darren Govoni [mailto:dar...@ontrenet.com] Sent: Wednesday, January 18, 2012 8:40 PM To: solr-user@lucene.apache.org Subject: Re: How to accelerate your Solr-Lucene appication by 4x And to be honest, many people on this list are professionals who not only build their own solutions, but also buy tools and tech. I don't see what the big deal is if some clever company has something of imminent value here to share it. Considering that its a rare event. On 01/18/2012 08:28 PM, Jason Rutherglen wrote: Steven, If you are going to admonish people for advertising, it should be equally dished out or not at all. On Wed, Jan 18, 2012 at 6:38 PM, Steven A Rowesar...@syr.edu wrote: Hi Peter, Commercial solicitations are taboo here, except in the context of a request for help that is directly relevant to a product or service. Please don’t do this again. Steve Rowe From: Peter Velikin [mailto:pe...@velobit.com] Sent: Wednesday, January 18, 2012 6:33 PM To: solr-user@lucene.apache.org Subject: How to accelerate your Solr-Lucene appication by 4x Hello Solr users, Did you know that you can boost the performance of your Solr application using your existing servers? All you need is commodity SSD and plug-and-play software like VeloBit. At ZoomInfo, a leading business information provider, VeloBit increased the performance of the Solr-Lucene-powered application by 4x. I would love to tell you more about VeloBit and find out if we can deliver same business benefits at your company. Click herehttp://www.velobit.com/15-minute-brief for a 15-minute briefinghttp://www.velobit.com/15-minute-brief on the VeloBit technology. Here is more information on how VeloBit helped ZoomInfo: * Increased Solr-Lucene performance by 4x using existing servers and commodity SSD * Installed VeloBit plug-and-play SSD caching software in 5-minutes transparent to running applications and storage infrastructure * Reduced by 75% the hardware and monthly operating costs required to support service level agreements Technical Details: * Environment: Solr‐Lucene indexed directory search service fronted by J2EE web application technology * Index size: 600 GB * Number of items indexed: 50 million * Primary storage: 6 x SAS HDD * SSD Cache: VeloBit software + OCZ Vertex 3 Click herehttp://www.velobit.com/use-cases/enterprise-search/ to read more about the ZoomInfo Solr-Lucene case studyhttp://www.velobit.com/use-cases/enterprise-search/. You can also sign uphttp://www.velobit.com/early-access-program- accelerate-application for our Early Access Programhttp://www.velobit.com/early-access-program-accelerate- application and try VeloBit HyperCache for free. Also, feel free to write to me directly at pe...@velobit.commailto:pe...@velobit.com. Best regards, Peter Velikin VP Online Marketing, VeloBit, Inc. pe...@velobit.commailto:pe...@velobit.com tel. 978-263-4800 mob. 617-306-7165 [Description: VeloBit with tagline] VeloBit provides plug play SSD caching software that dramatically accelerates applications at a remarkably low cost. The software installs seamlessly in less than 10 minutes and automatically tunes for fastest application speed. Visit www.velobit.comhttp://www.velobit.com for details.
Re: How to accelerate your Solr-Lucene appication by 4x
Agree. There's probably some unwritten etiquette there. On 01/19/2012 05:52 AM, Patrick Plaatje wrote: Partially agree. If just the facts are given, and not a complete sales talk instead, it'll be fine. Don't overdo it like this though. Cheers, Patrick 2012/1/19 Darren Govonidar...@ontrenet.com I think the occassional Hey, we made something cool you might be interested in! notice, even if commercial, is ok because it addresses numerous issues we struggle with on this list. Now, if it were something completely off-base or unrelated (e.g. male enhancement pills), then yeah, I agree. On 01/18/2012 11:08 PM, Steven A Rowe wrote: Hi Darren, I think it's rare because it's rare: if this were found to be a useful advertising space, rare would cease to be descriptive of it. But I could be wrong. Steve -Original Message- From: Darren Govoni [mailto:dar...@ontrenet.com] Sent: Wednesday, January 18, 2012 8:40 PM To: solr-user@lucene.apache.org Subject: Re: How to accelerate your Solr-Lucene appication by 4x And to be honest, many people on this list are professionals who not only build their own solutions, but also buy tools and tech. I don't see what the big deal is if some clever company has something of imminent value here to share it. Considering that its a rare event. On 01/18/2012 08:28 PM, Jason Rutherglen wrote: Steven, If you are going to admonish people for advertising, it should be equally dished out or not at all. On Wed, Jan 18, 2012 at 6:38 PM, Steven A Rowesar...@syr.eduwrote: Hi Peter, Commercial solicitations are taboo here, except in the context of a request for help that is directly relevant to a product or service. Please don’t do this again. Steve Rowe From: Peter Velikin [mailto:pe...@velobit.com] Sent: Wednesday, January 18, 2012 6:33 PM To: solr-user@lucene.apache.org Subject: How to accelerate your Solr-Lucene appication by 4x Hello Solr users, Did you know that you can boost the performance of your Solr application using your existing servers? All you need is commodity SSD and plug-and-play software like VeloBit. At ZoomInfo, a leading business information provider, VeloBit increased the performance of the Solr-Lucene-powered application by 4x. I would love to tell you more about VeloBit and find out if we can deliver same business benefits at your company. Click herehttp://www.velobit.com/**15-minute-briefhttp://www.velobit.com/15-minute-brief for a 15-minute briefinghttp://www.velobit.**com/15-minute-briefhttp://www.velobit.com/15-minute-brief on the VeloBit technology. Here is more information on how VeloBit helped ZoomInfo: * Increased Solr-Lucene performance by 4x using existing servers and commodity SSD * Installed VeloBit plug-and-play SSD caching software in 5-minutes transparent to running applications and storage infrastructure * Reduced by 75% the hardware and monthly operating costs required to support service level agreements Technical Details: * Environment: Solr‐Lucene indexed directory search service fronted by J2EE web application technology * Index size: 600 GB * Number of items indexed: 50 million * Primary storage: 6 x SAS HDD * SSD Cache: VeloBit software + OCZ Vertex 3 Click herehttp://www.velobit.com/**use-cases/enterprise-search/http://www.velobit.com/use-cases/enterprise-search/ to read more about the ZoomInfo Solr-Lucene case studyhttp://www.velobit.com/**use-cases/enterprise-search/http://www.velobit.com/use-cases/enterprise-search/ . You can also sign uphttp://www.velobit.com/**early-access-program-http://www.velobit.com/early-access-program- accelerate-applicationfor our Early Access Programhttp://www.velobit.**com/early-access-program-**accelerate-http://www.velobit.com/early-access-program-accelerate- applicationand try VeloBit HyperCache for free. Also, feel free to write to me directly at pe...@velobit.commailto:peter**@velobit.compe...@velobit.com. Best regards, Peter Velikin VP Online Marketing, VeloBit, Inc. pe...@velobit.commailto:peter**@velobit.compe...@velobit.com tel. 978-263-4800 mob. 617-306-7165 [Description: VeloBit with tagline] VeloBit provides plugplay SSD caching software that dramatically accelerates applications at a remarkably low cost. The software installs seamlessly in less than 10 minutes and automatically tunes for fastest application speed. Visit www.velobit.comhttp://www.**velobit.comhttp://www.velobit.com for details.
Re: How can a distributed Solr setup scale to TB-data, if URL limitations are 4000 for distributed shard search?
Try changing the URI/HTTP/GET size limitation on your app server. On 01/18/2012 05:59 PM, Daniel Bruegge wrote: Hi, I am just wondering how I can 'grow' a distributed Solr setup to an index size of a couple of terabytes, when one of the distributed Solr limitations is max. 4000 characters in URI limitation. See: *The number of shards is limited by number of characters allowed for GET method's URI; most Web servers generally support at least 4000 characters, but many servers limit URI length to reduce their vulnerability to Denial of Service (DoS) attacks. * *(via http://lucidworks.lucidimagination.com/display/solr/Distributed+Search+with+Index+Sharding )* Is the only way then to make multiple distributed solr clusters and query them independently and merge them in application code? Thanks. Daniel
Re: How to accelerate your Solr-Lucene appication by 4x
And to be honest, many people on this list are professionals who not only build their own solutions, but also buy tools and tech. I don't see what the big deal is if some clever company has something of imminent value here to share it. Considering that its a rare event. On 01/18/2012 08:28 PM, Jason Rutherglen wrote: Steven, If you are going to admonish people for advertising, it should be equally dished out or not at all. On Wed, Jan 18, 2012 at 6:38 PM, Steven A Rowesar...@syr.edu wrote: Hi Peter, Commercial solicitations are taboo here, except in the context of a request for help that is directly relevant to a product or service. Please don’t do this again. Steve Rowe From: Peter Velikin [mailto:pe...@velobit.com] Sent: Wednesday, January 18, 2012 6:33 PM To: solr-user@lucene.apache.org Subject: How to accelerate your Solr-Lucene appication by 4x Hello Solr users, Did you know that you can boost the performance of your Solr application using your existing servers? All you need is commodity SSD and plug-and-play software like VeloBit. At ZoomInfo, a leading business information provider, VeloBit increased the performance of the Solr-Lucene-powered application by 4x. I would love to tell you more about VeloBit and find out if we can deliver same business benefits at your company. Click herehttp://www.velobit.com/15-minute-brief for a 15-minute briefinghttp://www.velobit.com/15-minute-brief on the VeloBit technology. Here is more information on how VeloBit helped ZoomInfo: * Increased Solr-Lucene performance by 4x using existing servers and commodity SSD * Installed VeloBit plug-and-play SSD caching software in 5-minutes transparent to running applications and storage infrastructure * Reduced by 75% the hardware and monthly operating costs required to support service level agreements Technical Details: * Environment: Solr‐Lucene indexed directory search service fronted by J2EE web application technology * Index size: 600 GB * Number of items indexed: 50 million * Primary storage: 6 x SAS HDD * SSD Cache: VeloBit software + OCZ Vertex 3 Click herehttp://www.velobit.com/use-cases/enterprise-search/ to read more about the ZoomInfo Solr-Lucene case studyhttp://www.velobit.com/use-cases/enterprise-search/. You can also sign uphttp://www.velobit.com/early-access-program-accelerate-application for our Early Access Programhttp://www.velobit.com/early-access-program-accelerate-application and try VeloBit HyperCache for free. Also, feel free to write to me directly at pe...@velobit.commailto:pe...@velobit.com. Best regards, Peter Velikin VP Online Marketing, VeloBit, Inc. pe...@velobit.commailto:pe...@velobit.com tel. 978-263-4800 mob. 617-306-7165 [Description: VeloBit with tagline] VeloBit provides plug play SSD caching software that dramatically accelerates applications at a remarkably low cost. The software installs seamlessly in less than 10 minutes and automatically tunes for fastest application speed. Visit www.velobit.comhttp://www.velobit.com for details.
Highlighting in 3.5?
Hi, Can someone tell me if this is correct behavior from Solr. I search on a dynamic field: field_t:[* TO *] I set highlight fields to field_t,text_t but I am not searching specifically inside text_t field. The highlights for text_t come back with EVERY WORD. Maybe because of the [* TO *], but the query semantics indicate not searching on text_t even though highlighting is enabled. Is this correct behavior? it produces unwanted highlight results. I would expect Solr to know what fields are participating in the query and only highlight those that are involved in the result set. Thanks, Darren
Re: Highlighting in 3.5?
Hi Juan, Setting that parameter produces the same extraneous results. Here is my query: {!lucene q.op=OR df=text_t} kind_s:doc AND (( field_t:[* TO *] )) Clearly, the default field (text_t) is not being searched by this query and highlighting it would be semantically incongruent with the query. Is it a bug? Darren On 01/02/2012 04:39 PM, Juan Grande wrote: Hi Darren, This is the expected behavior. Have you tried setting the hl.requireFieldMatch parameter to true? See: http://wiki.apache.org/solr/HighlightingParameters#hl.requireFieldMatch *Juan* On Mon, Jan 2, 2012 at 10:54 AM, Darren Govonidar...@ontrenet.com wrote: Hi, Can someone tell me if this is correct behavior from Solr. I search on a dynamic field: field_t:[* TO *] I set highlight fields to field_t,text_t but I am not searching specifically inside text_t field. The highlights for text_t come back with EVERY WORD. Maybe because of the [* TO *], but the query semantics indicate not searching on text_t even though highlighting is enabled. Is this correct behavior? it produces unwanted highlight results. I would expect Solr to know what fields are participating in the query and only highlight those that are involved in the result set. Thanks, Darren
Re: Highlighting in 3.5?
Forgot to add, that the time when I DO want the highlight to appear would be with a query that DOES match the default field. {!lucene q.op=OR df=text_t} kind_s:doc AND (( field_t:[* TO *] )) cars Where the term 'cars' would be matched against the df. Then I want the highlight for it. If there are no query term matches for the df, then getting ALL the field terms highlighted (as it does now) is rather perplexing feature. Darren On 01/02/2012 06:28 PM, Darren Govoni wrote: Hi Juan, Setting that parameter produces the same extraneous results. Here is my query: {!lucene q.op=OR df=text_t} kind_s:doc AND (( field_t:[* TO *] )) Clearly, the default field (text_t) is not being searched by this query and highlighting it would be semantically incongruent with the query. Is it a bug? Darren On 01/02/2012 04:39 PM, Juan Grande wrote: Hi Darren, This is the expected behavior. Have you tried setting the hl.requireFieldMatch parameter to true? See: http://wiki.apache.org/solr/HighlightingParameters#hl.requireFieldMatch *Juan* On Mon, Jan 2, 2012 at 10:54 AM, Darren Govonidar...@ontrenet.com wrote: Hi, Can someone tell me if this is correct behavior from Solr. I search on a dynamic field: field_t:[* TO *] I set highlight fields to field_t,text_t but I am not searching specifically inside text_t field. The highlights for text_t come back with EVERY WORD. Maybe because of the [* TO *], but the query semantics indicate not searching on text_t even though highlighting is enabled. Is this correct behavior? it produces unwanted highlight results. I would expect Solr to know what fields are participating in the query and only highlight those that are involved in the result set. Thanks, Darren
Re: Poor performance on distributed search
I see what you are asking. This is an interesting question. It seems inefficient for Solr to apply the requested rows to all shards only to discard most of the results on merge. That would consume lots of resources not used in the final result set. On 12/19/2011 04:32 PM, ku3ia wrote: Uhm, either I misunderstand your question or you're doing a lot of extra work for nothing The whole point of sharding it exactly to collect the top N docs from each shard and merge them into a single result. So if you want 10 docs, just specify rows=10. Solr will query all the shards, get the top 10 docs from each and then merge them into a final list 10 items long. Both the initial fetch and the final merge are based on the sort criteria are respected. Score is the default sort. If you specify other sort criteria, i.e. a field, then that sort is respected by the merge process. So why do you have this 2,000 requirement in the first place? This really sounds like an XY problem. As I wrote it is a minimum for me. I can't change it. Final response must has top 2K docs from all shards by query, so I specify rows=2000. Yeah, it collects top N docs from each shard. In my case N=2000, so on production I have 2000x30=60K, and on my own machine 2000x4=8K docs. Its true, this is an extra work, but in other case, seems it's only way to get top 2K docs from all shards, am I right? P.S. Is any mechanism, for example, to get top 100 rows from each shard, only merge it, sort by defined at query filed or score and pull result to the user? Uhm, either I misunderstand your question For example I have 4 shards. Finally, I need 2000 docs. Now, when I'm using shards=127.0.0.1:8080/solr/shard1,127.0.0.1:8080/solr/shard2,127.0.0.1:8080/solr/shard3,127.0.0.1:8080/solr/shard4 Solr gets 2000 docs from each shard (shard1,2,3,4, summary we have 8000 docs) merge and sort it, for example, by default field (score), and returns me only 2000 rows (not all 8000), which I specified at request. So, my question was about, is any mechanism in Solr, which gets not 2000 rows from each shard, and say, If I specified 2000 docs at request, Solr calculates how much shards I have (four shards), divides total rows onto shards (2000/4=500) and sends to each shard queries with rows=500, but not rows=2000, so finally, summary after merging and sorting I'll have 2000 rows (maybe less), but not 8000... That was my question. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Poor-performance-on-distributed-search-tp3590028p3599636.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Grouping or Facet ?
Yes. That's what I would expect. I guess I didn't understand when you said The facet counts are the counts of the *values* in that field Because it seems its the count of the number of matching documents irrespective if one document has 20 values for that field and another 10, the facet count will be 2, one for each document in the results. On 12/07/2011 09:04 AM, Erick Erickson wrote: In your example you'll have 10 facets returned each with a value of 1. Best Erick On Tue, Dec 6, 2011 at 9:54 AM,dar...@ontrenet.com wrote: Sorry to jump into this thread, but are you saying that the facet count is not # of result hits? So if I have 1 document with field CAT that has 10 values and I do a query that returns this 1 document with faceting, that the CAT facet count will be 10 not 1? I don't seem to be seeing that behavior in my app (Solr 3.5). Thanks. OK, I'm not understanding here. You get the counts and the results if you facet on a single category field. The facet counts are the counts of the *values* in that field. So it would help me if you showed the output of faceting on a single category field and why that didn't work for you But either way, faceting will probably outperform grouping. Best Erick On Mon, Dec 5, 2011 at 9:05 AM, Juan Pablo Morajua...@informa.es wrote: Because I need the count and the result to return back to the client side. Both the grouping and the facet offers me a solution to do that, but my doubt is about performance ... With Grouping my results are: grouped:{ category:{ matches: ..., groups:[{ groupValue:categoryXX, doclist:{numFound:Important_number,start:0,docs:[ { doc:id category:XX } groupValue:categoryYY, doclist:{numFound:Important_number,start:0,docs:[ { doc: id category:YY } And with faceting my results are : facet.prefix=whatever facet_counts:{ facet_queries:{}, facet_fields:{ namesXX:[ whatever_name_in_category,76, ... namesYY:[ whatever_name_in_category,76, ... Both results are OK to me. De: Erick Erickson [erickerick...@gmail.com] Enviado el: lunes, 05 de diciembre de 2011 14:48 Para: solr-user@lucene.apache.org Asunto: Re: Grouping or Facet ? Why not just use the first form of the document and just facet.field=category? You'll get two different facet counts for XX and YY that way. I don't think grouping is the way to go here. Best Erick On Sat, Dec 3, 2011 at 6:43 AM, Juan Pablo Morajua...@informa.es wrote: I need to do some counts on a StrField field to suggest options from two different categories, and I don´t know what option is the best: My schema looks: - id - name - category: XX or YY with Grouping I do: http://localhost:8983/?q=name:prefix*group=truegroup.field=category But I can change my schema to to: - id - nameXX - nameYY - category: XX or YY (only 1 value in nameXX or nameYY) With facet: http://localhost:8983/?q=*:*facet=truefacet.field=nameXXfacet.field=nameYYfacet.prefix=prefix What option have the best performance ? Best, Juampa.
Re: Solr 3.5 very slow (performance)
Monitoring this thread make me ask the question of whether there are standardized performance benchmarks for Solr. Such that they are run and published with each new release. This would affirm its performance under known circumstances, with which people can try in their own environments and compare to their application behavior. I think it would be a good idea. On 11/30/2011 04:12 PM, Pawel Rog wrote: On Wed, Nov 30, 2011 at 9:05 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : I tried to use index from 1.4 (load was the same as on index from 3.5) : but there was problem with synchronization with master (invalid : javabin format) : Then I built new index on 3.5 with luceneMatchVersion LUCENE_35 why would you need to re-replicate from the master? You already have a copy of the Solr 1.4 index on the slave machine where you are doing testing correct? Just (make sure Solr 1.4 isn't running and) point Solr 3.5 at that solr home directory for the configs and data and time that. (Just because Solr 3.5 can't replicate from Solr 1.4 over HTTP doesn't mean it can't open indexes built by Solr 1.4) I made It before sending earlier e-mail. Efect was the same. It's important to understand if the discrepencies you are seeing have to do with *building* the index under Solr 3.5, or *searching* in Solr 3.5. : reader : SolrIndexReader{this=8cca36c,r=ReadOnlyDirectoryReader@8cca36c,refCnt=1,segments=4} : readerDir : org.apache.lucene.store.NIOFSDirectory@/data/solr_data/itemsfull/index : : solr 3.5 : reader : SolrIndexReader{this=3d01e178,r=ReadOnlyDirectoryReader@3d01e178,refCnt=1,segments=14} : readerDir : org.apache.lucene.store.MMapDirectory@/data/solr_data_350/itemsfull/index : lockFactory=org.apache.lucene.store.NativeFSLockFactory@294ce5eb As mentioned, the difference in the number of segments may be contributing to the perf differences you are seeing, so optimizing both indexes (or doing a partial optimize of your 3.5 index down to 4 segments) for comparison would probably be worthwhile. (and if that is the entirety of hte problem, then explicitly configuring a MergePolicy may help you in the long run) but independent of that I would like to suggest that you first try explicitly configuring Solr 3.5 to use NIOFSDirectory so it's consistent with what Solr 1.4 was doing (I'm told MMapDirectory should be faster, but maybe there's something about your setup that makes that not true) So it would be helpful to also try adding this to your 3.5 solrconfig.xml and testing ... directoryFactory name=DirectoryFactory class=solr.NIOFSDirectoryFactory/ : I made some test with quiet heavy query (with frange). In both cases : (1.4 and 3.5) I used the same newSearcher queries and started solr : without any load. : Results of debug timing Ok, well ... honestly: giving us *one* example of the timing data for *one* query (w/o even telling us what the exact query was) ins't really anything we can use to help you ... the crux of the question was: was the slow performance you are seeing only under heavy load or was it also slow when you did manual testing? : When I send fewer than 60 rps I see that in comparsion to 1.4 median : response time is worse, avarage is worse but maximum time is better. : It doesn't change propotion of cpu usage (3.5 uses much more cpu). How much fewer then 60 rps ? ... I'm trying to understand if the problems you are seeing are solely happening under heavy concurrent load, or if you are seeing Solr 3.5 consistently respond much slower then Solr 1.4 even with a single client? Also: I may still be missunderstanding how you are generating load, and wether you are throttling the clients, but seeing higher CPU utilization in Solr 3.5 isn't neccessarily an indication of something going wrong -- in some cases higher CPU% (particularly under heavy concurrent load on a multi-core machine) could just mean that Solr is now capable of utilizing more CPU to process parallel request, where as previous versions might have been hitting other bottle necks. -- but that doesn't explain the slower response times. that's what concerns me the most. I don't think that 1200% CPU usage with the same traffic is better then 200%. I think you are wrong :) Using solr 1.4 I can reach 300rps and then reach 1200% on cpu and only 60rps in solr 3.5 FWIW: I'm still wondering what the stats from your caches wound up looking like on both Solr 1.4 and Solr 3.5... 7) What do the cache stats look like on your Solr 3.5 instance after you've done some of this timing testing? the output of... http://localhost:8983/solr/admin/mbeans?cat=CACHEstats=truewt=jsonindent=true ...would be helpful. NOTE: you may need to add this to your solrconfig.xml for that URL to work... requestHandler name=/admin/ class=solr.admin.AdminHandlers /' ...but i don't think /admin/mbeans exists in Solr 1.4, so you may just have to get the details from stats.jsp. I forgot to write it earlier. QueryCache hit rate was about 0.03 (in solr
Re: Solr 3.5 very slow (performance)
Any suspicous activity in the logs? what about disk activity? On 11/29/2011 05:22 PM, Pawel Rog wrote: On Tue, Nov 29, 2011 at 9:13 PM, Chris Hostetter hossman_luc...@fucit.org wrote: Let's back up a minute and cover some basics... 1) You said that you built a brand new index on a brand new master server, using Solr 3.5 -- how do you build your indexes? did the source data change at all? does your new index have the same number of docs as your previous Solr 1.4 index? what does a directory listing (including file sizes) look like for both your old and new indexes? Yes, both indexes have same data. Indexes are build using some C++ programm which reads data from database and inserts it into Solr (using XML). Both indexes have about 8GB size and 18milions documents. 2) Did you try using your Solr 1.4 index (and configs) directly in Solr 3.5 w/o rebuilding from scratch? Yes I used the same configs in solr 1.4 and solr 3.5 (adding only line about luceneMatchVersion) As I see in example of solr 3.5 in repository (solrconfig.xml) there are not many diffrences. 3) You said you build the new index on a new mmachine, but then you said you used a slave where the performanne was worse then Solr 1.4 on the same machine ... are you running both the Solr 1.4 and Solr 3.5 instances concurrently on your slave machine? How much physical ram is on that machine? what JVM options are using when running the Solr 3.5 instance? what servlet container are you using? Mayby I didn't wrote precisely enough. I have some machine on which there is master node. I have second machine on which there is slave. I tested solr 1.4 on that machine, then turned it off and turned on solr-3.5. I have 36GB RAM on that machine. On both - solr 1.4 and 3.5 configuration of JVM is the same, and the same servlet container ... jetty-6 JVM options: -server -Xms12000m -Xmx12000m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:NewSize=1500m -XX:ParallelGCThreads=8 -XX:CMSInitiatingOccupancyFraction=60 4) what does your request handler configuration look like? do you have any default/invariant/appended request params? requestHandler name=standard class=solr.SearchHandler default=true lst name=defaults str name=echoParamsexplicit/str /lst /requestHandler requestHandler name=/admin/ class=org.apache.solr.handler.admin.AdminHandlers / requestHandler name=/replication class=solr.ReplicationHandler lst name=slave !--fully qualified url for the replication handler of master . It is possible to pass on this as a request param for the fetchindexommand-- str name=masterUrlhttp://${masterHost}:${masterPort}/solr-3.5/${solr.core.instanceDir}replication/str str name=pollInterval00:00:02/str str name=httpConnTimeout5000/str str name=httpReadTimeout1/str /lst /requestHandler 5) The descriptions youve given of how the performance has changed sound like you are doing concurrent load testing -- did you do cache warming before you started your testing? how many client threads are hitting the solr server at one time? Maybe I wasn't precise enough again. CPU on solr 1.4 was 200% and on solr 3.5 1200% yes there is cache warming. There are 50-100 client threads on both 1.4 and 3.5. There are about 60 requests per second on 3.5 and on 1.4, but on 3.5 responses are slower and CPU usage much higher. 6) have you tried doing some basic manual testing to see how individual requests performe? ie: single client at a time, loading a URL, then request the same URL again to verify that your Solr caches are in use and the QTime is low. If you see slow respone times even when manually executing single requests at a time, have you tried using debug=timing to see which serach components are contributing the most to the slow QTimes? Most time is in org.apache.solr.handler.component.QueryComponent and org.apache.solr.handler.component.DebugComponent in process. I didn't comare individual request performance. 7) What do the cache stats look like on your Solr 3.5 instance after you've done some of this timing testing? the output of... http://localhost:8983/solr/admin/mbeans?cat=CACHEstats=truewt=jsonindent=true ...would be helpful. NOTE: you may need to add this to your solrconfig.xml for that URL to work... requestHandler name=/admin/ class=solr.admin.AdminHandlers /' Will check it :) : in my last pos i mean : default operation AND : promoted - int : ending - int : b_count - int : name - text : cat1 - int : cat2 - int : : On Tue, Nov 29, 2011 at 7:54 PM, Pawel Rogpawelro...@gmail.com wrote: : examples : : facet=truesort=promoted+desc,ending+asc,b_count+descfacet.mincount=1start=0q=name:(kurtka+skóry+brazowe42)facet.limit=500facet.field=cat1facet.field=cat2wt=jsonrows=50 : :
Query time help
Hi, I am running Solr 3.4 in a glassfish domain for itself. I have about 12,500 documents with a 100 or so fields with the works (stored, termv's, etc). In my webtier code, I use SolrJ and execute a query as such: long querystart = new Date().getTime(); System.out.println(BEFORE QUERY TIME: + (querystart - startime) + milliseconds.); 1. QueryResponse qr = solr.query(aquery, METHOD.POST); long queryend = new Date().getTime(); System.out.println(QUERY TIME: + (queryend - querystart) + milliseconds. Before QUERY TIME. + (querystart - startime)); The Qtime in the response reads 50-77. But line 1. takes anywhere from 5 - 13 seconds to complete. Here is query: {!lucene q.op=OR df=text_t} ( kind_s:doc OR kind_s:xml) AND (( item_sm_t:[* TO *] )) AND (usergroup_sm:admin) What could be causing this faulty delay? Server has 15GB RAM. Responses are not unreasonably large. I use paging. Many thanks, Darren
Re: inconsistent results when faceting on multivalued field
My interpretation of your results are that your FQ found 1281 documents with 1213206 value in sou_codeMetier field. Of those results, 476 also had 1212104 as a value...and so on. Since ALL the results will have the field value in your FQ, then I would expect the other values to be equal or less occurring from the result set, which they appear to be. On 10/21/2011 03:55 AM, Alain Rogister wrote: Pravesh, Not exactly. Here is the search I do, in more details (different field name, but same issue). I want to get a count for a specific value of the sou_codeMetier field, which is multivalued. I expressed this by including a fq clause : /select/?q=*:*facet=truefacet.field=sou_codeMetierfq=sou_codeMetier:1213206rows=0 The response (excerpt only): lst name=facet_fields lst name=sou_codeMetier int name=12132061281/int int name=1212104476/int int name=121320603285/int int name=1213101260/int int name=121320602208/int int name=121320605171/int int name=1212201152/int ... As you see, I get back both the expected results and extra results I would expect to be filtered out by the fq clause. I can eliminate the extra results with a 'f.sou_codeMetier.facet.prefix=1213206' clause. But I wonder if Solr's behavior is correct and how the fq filtering works exactly. If I replace the facet.field clause with a facet.query clause, like this: /select/?q=*:*facet=truefacet.query=sou_codeMetier:[1213206 TO 1213206]rows=0 The results contain a single item: lst name=facet_queries int name=sou_codeMetier:[1213206 TO 1213206]1281/int /lst The 'fq=sou_codeMetier:1213206' clause isn't necessary here and does not affect the results. Thanks, Alain On Fri, Oct 21, 2011 at 9:18 AM, praveshsuyalprav...@yahoo.com wrote: Could u clarify on below: When I make a search on facet.qua_code=1234567 ?? Are u trying to say, when u fire a fresh search for a facet item, like; q=qua_code:1234567?? This this would fetch for documents where qua_code fields contains either the terms 1234567 OR both terms (1234567 9384738.and others terms). This would be since its a multivalued field and hence if you see the facet, then its shown for both the terms. If I reword the query as 'facet.query=qua_code:1234567 TO 1234567', I only get the expected counts You will get facet for documents which have term 1234567 only (facet.query would apply to the facets,so as to which facet to be picked/shown) Regds Pravesh -- View this message in context: http://lucene.472066.n3.nabble.com/inconsistent-results-when-faceting-on-multivalued-field-tp3438991p3440128.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Merging Remote Solr Indexes?
Interesting Yury. Thanks. On 10/20/2011 11:00 AM, Yury Kats wrote: On 10/19/2011 5:15 PM, Darren Govoni wrote: Hi Otis, Yeah, I saw page, but it says for merging cores, which I presume must reside locally to the solr instance doing the merging? What I'm interested in doing is merging across solr instances running on different machines into a single solr running on another machine (programmatically). Is it still possible or did I misread the wiki? Possible, but in a few steps. 1. Create new cores on another machine. 2. Replicate them from different machine. 3. Merge on another machine. All 3 steps can be done programmatically.
Re: Merging Remote Solr Indexes?
Hi Otis, Yeah, I saw page, but it says for merging cores, which I presume must reside locally to the solr instance doing the merging? What I'm interested in doing is merging across solr instances running on different machines into a single solr running on another machine (programmatically). Is it still possible or did I misread the wiki? Thanks! Darren On 10/19/2011 11:57 AM, Otis Gospodnetic wrote: Hi Darren, http://search-lucene.com/?q=solr+mergefc_project=Solr Check hit #1 Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: dar...@ontrenet.comdar...@ontrenet.com To: solr-user@lucene.apache.org Sent: Wednesday, October 19, 2011 10:04 AM Subject: Merging Remote Solr Indexes? Hi, I thought of a useful capability if it doesn't already exist. Is it possible to do an index merge between two remote Solr's? To handle massive index-time scalability, wouldn't it be useful to have distributed indexes accepting local input, then merge them into one central index after? Darren
Re: Merging Remote Solr Indexes?
Actually, yeah. If you think about it a remote merge is like the inverse of replication. Where replication is a one to many away from an index, the inverse would be merging many back to the one. Sorta like a recall. I think it would be a great analog to replication. On 10/19/2011 06:18 PM, Otis Gospodnetic wrote: Darren, No, that is not possible without one copying an index/shard to a single machine on which you would then merge indices as described on the Wiki. H, wouldn't it be nice to make use of existing replication code to make it possible to move shards around the cluster? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Darren Govonidar...@ontrenet.com To: solr-user@lucene.apache.org Sent: Wednesday, October 19, 2011 5:15 PM Subject: Re: Merging Remote Solr Indexes? Hi Otis, Yeah, I saw page, but it says for merging cores, which I presume must reside locally to the solr instance doing the merging? What I'm interested in doing is merging across solr instances running on different machines into a single solr running on another machine (programmatically). Is it still possible or did I misread the wiki? Thanks! Darren On 10/19/2011 11:57 AM, Otis Gospodnetic wrote: Hi Darren, http://search-lucene.com/?q=solr+mergefc_project=Solr Check hit #1 Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: dar...@ontrenet.comdar...@ontrenet.com To: solr-user@lucene.apache.org Sent: Wednesday, October 19, 2011 10:04 AM Subject: Merging Remote Solr Indexes? Hi, I thought of a useful capability if it doesn't already exist. Is it possible to do an index merge between two remote Solr's? To handle massive index-time scalability, wouldn't it be useful to have distributed indexes accepting local input, then merge them into one central index after? Darren
Re: basic solr cloud questions
That was kinda my point. The new cloud implementation is not about replication, nor should it be. But rather about horizontal scalability where nodes manage different parts of a unified index. One of the design goals of the new cloud implementation is for this to happen more or less automatically. To me that means one does not have to manually distributed documents or enforce replication as Yurly suggests. Replication is different to me than what was being asked. And perhaps I misunderstood the original question. Yurly's response introduced the term core where the original person was referring to nodes. For all I know, those are two different things in the new cloud design terminology (I believe they are). I guess understanding cores vs. nodes vs shards is helpful. :) cheers! Darren On 09/29/2011 12:00 AM, Pulkit Singhal wrote: @Darren: I feel that the question itself is misleading. Creating shards is meant to separate out the data ... not keep the exact same copy of it. I think the two node setup that was attempted by Sam mislead him and us into thinking that configuring two nodes which are to be named shard1 ... somehow means that they are instantly replicated too ... this is not the case! I can see how this misunderstanding can develop as I too was confused until Yury cleared it up. @Sam: If you are interested in performing a quick exercise to understand the pieces involved for replication rather than sharding ... perhaps this link would be of help in taking you through it: http://pulkitsinghal.blogspot.com/2011/09/setup-solr-master-slave-replication.html - Pulkit 2011/9/27 Yury Katsyuryk...@yahoo.com: On 9/27/2011 5:16 PM, Darren Govoni wrote: On 09/27/2011 05:05 PM, Yury Kats wrote: You need to either submit the docs to both nodes, or have a replication setup between the two. Otherwise they are not in sync. I hope that's not the case. :/ My understanding (or hope maybe) is that the new Solr Cloud implementation will support auto-sharding and distributed indexing. This means that shards will receive different documents regardless of which node received the submitted document (spread evenly based on a hash-node assignment). Distributed queries will thus merge all the solr shard/node responses. All cores in the same shard must somehow have the same index. Only then can you continue servicing searches when individual cores fail. Auto-sharding and distributed indexing don't have anything to do with this. In the future, SolrCloud may be managing replication between cores in the same shard automatically. But right now it does not.
Re: basic solr cloud questions
Agree. Thanks also for clarifying. It helps. On 09/29/2011 08:50 AM, Yury Kats wrote: On 9/29/2011 7:22 AM, Darren Govoni wrote: That was kinda my point. The new cloud implementation is not about replication, nor should it be. But rather about horizontal scalability where nodes manage different parts of a unified index. It;s about many things. You stated one, but there are goals, one of them being tolerance to node outages. In a cloud, when one of your many nodes fail, you don't want to stop querying and indexing. For this to happen, you need to maintain redundant copies of the same pieces of the index, hence you need to replicate. One of the design goals of the new cloud implementation is for this to happen more or less automatically. True, but there is a big gap between goals and current state. Right now, there is distributed search, but not distributed indexing or auto-sharding, or auto-replication. So if you want to use the SolrCloud now (as many of us do), you need do a number of things yourself, even if they might be done by SolrCloud automatically in the future. To me that means one does not have to manually distributed documents or enforce replication as Yurly suggests. Replication is different to me than what was being asked. And perhaps I misunderstood the original question. Yurly's response introduced the term core where the original person was referring to nodes. For all I know, those are two different things in the new cloud design terminology (I believe they are). I guess understanding cores vs. nodes vs shards is helpful. :) Shard is a slice of index. Index is managed/stored in a core. Nodes are Solr instances, usually physical machines. Each node can host multiple shards, and each shard can consist of multiple cores. However, all cores within the same shard must have the same content. This is where the OP ran into the problem. The OP had 1 shard, consisting of two cores on two nodes. Since there is no distributed indexing yet, all documents were indexed into a single core. However, there is distributed search, therefore queries were sent randomly to different cores of the same shard. Since one core in the shard had documents and the other didn't, the query result was random. To solve this problem, the OP must make sure all cores within the same shard (be they on the same node or not) have the same content. This can currently be achieved by: a) setting up replication between cores. you index into one core and the other core replicates the content b) indexing into both cores Hope this clarifies.
Re: basic solr cloud questions
On 09/27/2011 05:05 PM, Yury Kats wrote: You need to either submit the docs to both nodes, or have a replication setup between the two. Otherwise they are not in sync. I hope that's not the case. :/ My understanding (or hope maybe) is that the new Solr Cloud implementation will support auto-sharding and distributed indexing. This means that shards will receive different documents regardless of which node received the submitted document (spread evenly based on a hash-node assignment). Distributed queries will thus merge all the solr shard/node responses. This is similar in theory to how memcache and other big scale DHT's work. If its just manually replicated indexes then its not really a step forward from current Solr. :/
Re: Geo spatial search with multi-valued locations (SOLR-2155 / lucene-spatial-playground)
It doesn't. On 08/29/2011 01:37 PM, Mike Austin wrote: I've been trying to follow the progress of this and I'm not sure what the current status is. Can someone update me on what is currently in Solr4 and does it support multi-valued location in a single document? I saw that SOLR-2155 was not included and is now lucene-spatial-playground. Thanks, Mike
Paging over mutlivalued field results?
Hi, Is it possible to construct a query in Solr where the paged results are matching multivalued fields and not documents? thanks, Darren
Re: Paging over mutlivalued field results?
Hi Erick, Sure thing. I have a document schema where I put the sentences of that document in a multivalued field sentences. I search that field in a query but get back the document results, naturally. I then need to further find which exact sentences matched the query (for each document result) and then do my own paging since I am only returning pages of sentences and not the whole document. (i.e. I don't want to page the document results). Does this make sense? Or is there a better way Solr can accomodate this? Much appreciated. Darren On 08/25/2011 07:24 PM, Erick Erickson wrote: Hmm, I don't quite understand what you want. An example or two would help. Best Erick On Thu, Aug 25, 2011 at 12:11 PM, Darren Govonidar...@ontrenet.com wrote: Hi, Is it possible to construct a query in Solr where the paged results are matching multivalued fields and not documents? thanks, Darren