solrcloud -How to delete a doc at a specific shard
my solrcloud,3 shards,and 2replica, and one shard docs is duplicate,the document router is compositeId who can help me? -- View this message in context: http://lucene.472066.n3.nabble.com/solrcloud-How-to-delete-a-doc-at-a-specific-shard-tp4249354.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR 5.4.0?
Sorry for taking so long. I can confirm that SOLR-8418 is fixed for me in a self-built 5.5.0 snapshot. Now the next obvious question is, any ETA for a release? Regards, Ere 31.12.2015, 19.15, Erick Erickson kirjoitti: Ere: Can you help with testing the patch if it's important to you? Ramkumar is working on it... Best, Erick On Wed, Dec 30, 2015 at 11:07 PM, Ere Maijalawrote: Well, for us SOLR-8418 is a major issue. I haven't encountered other issues, but that one was sort of a show-stopper. --Ere 31.12.2015, 7.27, William Bell kirjoitti: How is SOLR 5.4.0 ? I heard there was a quick 5.4.1 coming out? Any major issues? -- Ere Maijala Kansalliskirjasto / The National Library of Finland -- Ere Maijala Kansalliskirjasto / The National Library of Finland
Re: date difference faceting
For anyone wanting to know an answer, I used facet.query={!frange l=0 u=3110400}ms(d_b,d_a) facet.query={!frange l=3110401 u=6220800}ms(d_b,d_a) facet.query={!frange l=6220801 u=15552000}ms(d_b,d_a) etc ... Not the prettiest nor most efficient but accomplishes what I need without re-indexing TBs of data. thanks. On 01/08/2016 12:09 PM, Erick Erickson wrote: I'm going to side-step your primary question and say that it's nearly always best to do your calculations up-front during indexing to make queries more efficient and thus serve more requests on the same hardware. This assumes that the stat you're interested in is predictable of course... Best, Erick On Fri, Jan 8, 2016 at 2:23 AM, David Santamaurowrote: Hi, I have two date fields, d_a and d_b, both of type solr.TrieDateField, that represent different events associated with a particular document. The interval between these dates is relevant for corner-case statistics. The interval is calculated as the difference: sub(d_b,d_a) and I've been able to stats=true={!func}sub(d_b,d_a) What I ultimately would like to report is the interval represented as a range, which could be seen as facet.query (pseudo code) facet.query=sub(d_b,d_a)[ * TO 8640 ] // day facet.query=sub(d_b,d_a)[ 8641 TO 60480 ] // week facet.query=sub(d_b,d_a)[ 60481 TO 259200 ] // month etc. Aside from actually indexing the difference in a separate field, is there something obvious I'm missing? I'm on SOLR 5.2 in cloud mode. thanks David
Re: SOLR replicas performance
Hi Luca, It looks like your queries are complex wildcard queries. My theory is that you are CPU-bounded, for a single query one CPU core for each shard will be at 100% for the duration of the sub-query. Smaller shards make these sub-queries faster which is why 16 shards is better than 8 in your case. * In your 16x1 configuration, you have exactly one shard per CPU core, so in a single query, 16 subqueries will go to both nodes evenly and use one of the CPU cores. * In your 8x2 configuration, you still get to use one CPU core per shard, but the shards are bigger, so maybe each subquery takes longer (for the single query thread and 8x2 scenario I would expect CPU utilization to be lower?). * In your 16x2 case 16 subqueries will be distributed un-evenly, and some node will get more than 8 subqueries, which means that some of the subqueries will have to wait for their turn for a CPU core. In addition, more Solr cores will be competing for resources. If this theory is correct, adding more replicas won't speedup your queries, you need to either get faster CPU or simplify your queries/configuration in some way. Adding more replicas should improve your query throughput, but only if you add them in more HW, not the same one. ...anyway, just a theory Tomás On Fri, Jan 8, 2016 at 7:40 AM, Shawn Heiseywrote: > On 1/8/2016 7:55 AM, Luca Quarello wrote: > > I used solr5.3.1 and I sincerely expected response times with replica > > configuration near to response times without replica configuration. > > > > Do you agree with me? > > > > I read here > > > http://lucene.472066.n3.nabble.com/Solr-Cloud-Query-Scaling-td4110516.html > > that "Queries do not need to be routed to leaders; they can be handled by > > any replica in a shard. Leaders are only needed for handling update > > requests. " > > > > I haven't found this behaviour. In my case CONF2 e CONF3 have all > replicas > > on VM2 but analyzing core utilization during a request is 100% on both > > machines. Why? > > Indexing is a little bit slower with replication -- the update must > happen on all replicas. > > If your index is sharded (which I believe you did indicate in your > initial message), you may find that all replicas get used even for > queries. It is entirely possible that some of the shard subqueries will > be processed on one replica and some of them will be processed on other > replicas. I do not know if this commonly happens, but I would not be > surprised if it does. If the machines are sized appropriately for the > index, this separation should speed up queries, because you have the > resources of multiple machines handling one query. > > That phrase "sized appropriately" is very important. Your initial > message indicated that you have a 90GB index, and that you are running > in virtual machines. Typically VMs have fairly small memory sizes. It > is very possible that you simply don't have enough memory in the VM for > good performance with an index that large. With 90GB of index data on > one machine, I would hope for at least 64GB of RAM, and I would prefer > to have 128GB. If there is more than 90GB of data on one machine, then > even more memory would be needed. > > Thanks, > Shawn > >
SolrCloud: Setting/finding node names for deleting replicas
Hi, I'm having trouble identifying a replica to delete... I've created a 3-shard cluster, all 3 created on a single host, then added a replica for shard2 onto another host, no problem so far. Now I want to delete the original shard, but got this error when trying a *replica* param value I thought would work... shard2/uk available replicas are core_node1,core_node4 I can't find any mention of core_node1 or core_node4 via the admin UI, how would I know/find the name of each one? Is it possible to set these names explicitly myself for easier maintenance? Many thanks for any guidance, Rob
Re: SOLR replicas performance
Hi Shawn, I expect that indexing is a little bit slower with replication but in my case is 3 times worst. I don't explain this. The monitored consumption of resources is: All the test have point out an I/O utilization of 100MB/s during loading data on disk cache, disk cache utilization of 20GB and core utilization of 100% (all 8 cores) so it seems that the bottleneck are cores and not RAM. I don't expect a performance improvement increasing RAM. Am I wrong? Thanks, Luca On Fri, Jan 8, 2016 at 4:40 PM, Shawn Heiseywrote: > On 1/8/2016 7:55 AM, Luca Quarello wrote: > > I used solr5.3.1 and I sincerely expected response times with replica > > configuration near to response times without replica configuration. > > > > Do you agree with me? > > > > I read here > > > http://lucene.472066.n3.nabble.com/Solr-Cloud-Query-Scaling-td4110516.html > > that "Queries do not need to be routed to leaders; they can be handled by > > any replica in a shard. Leaders are only needed for handling update > > requests. " > > > > I haven't found this behaviour. In my case CONF2 e CONF3 have all > replicas > > on VM2 but analyzing core utilization during a request is 100% on both > > machines. Why? > > Indexing is a little bit slower with replication -- the update must > happen on all replicas. > > If your index is sharded (which I believe you did indicate in your > initial message), you may find that all replicas get used even for > queries. It is entirely possible that some of the shard subqueries will > be processed on one replica and some of them will be processed on other > replicas. I do not know if this commonly happens, but I would not be > surprised if it does. If the machines are sized appropriately for the > index, this separation should speed up queries, because you have the > resources of multiple machines handling one query. > > That phrase "sized appropriately" is very important. Your initial > message indicated that you have a 90GB index, and that you are running > in virtual machines. Typically VMs have fairly small memory sizes. It > is very possible that you simply don't have enough memory in the VM for > good performance with an index that large. With 90GB of index data on > one machine, I would hope for at least 64GB of RAM, and I would prefer > to have 128GB. If there is more than 90GB of data on one machine, then > even more memory would be needed. > > Thanks, > Shawn > >
Re: enable disable filter query caching based on statistics
I read the client was happy, so I am only curious to know more :) Apart the readibility, shouldn't be more efficient to put the filters directly in the main query if you don't cache ? ( checking into the code when not caching is adding a lucene boolean query, with specifically 0 score, maybe this is an indication that at the current stage this affirmation is not true anymore. In the past it was a better approach than having them in separate filters.) How do you specify a filter to be a postFilter and run only over the query result cache ? Of course I don't know if you are excluding filters via tags or have some other requirements. I saw you specified gain in rpm, and what about the query time ? Related the rest of the issue is also in the solr comment in the source code : org/apache/solr/search/SolrIndexSearcher.java:1597 ... // now actually use the filter cache. // for large filters that match few documents, this may be // slower than simply re-executing the query. if (out.docSet == null) { out.docSet = getDocSet(cmd.getQuery(),cmd.getFilter()); DocSet bigFilt = getDocSet(cmd.getFilterList()); if (bigFilt != null) out.docSet = out.docSet.intersection(bigFilt); } ... Cheers Binoy: bq: In such a case won't applying fqs normally be the same as applying them as post filters Certainly not, at least AFAIK... By definition, regular FQs are calculated over the entire corpus (not, NOT just the docs that satisfy the query). Then that entire bitset is stored in the filterCache where it can be reused. Which is why filterCache entries can be used for different queries. Also by definition, post filters are _not_ calculated over the entire corpus, they are only calculated for docs that 1> pass the query criteria and 2> pass all lower-cost filters so they will not apply at all to the next query, are not stored in the filterCache etc. So I think what Matteo is seeing is that with a restrictive FQ clause, very few docs have to be tested against most of the FQs. Matteo: My guess (and I'm not intimately familiar with the code) is that, indeed the restrictive clause is helping you a lot here. Frankly I doubt if adding a cost will make a measurable difference if the most restrictive FQ clause is quite sparse I'm still puzzled in your test scenario why there is such a difference when making all the filer queries cache=false. _Assuming_ that provincia and type are relatively low-cardinality fields, they should all be in the filterCache pretty quickly But perhaps anding the bitset together is more expensive than the advantage in this case. I'd be curious as to the hit ratio you were seeing. But as you say, if the client is satisfied I'm not sure it's worth pursuing... Best, Erick On Tue, Jan 5, 2016 at 11:09 AM, Matteo Grollawrote: > Hi Erik, > the test was done on thousands of queries of that kind and milions of > docs > I went from <1500 qpm to ~ 6000 qpm on modest virtualized hardware (cpu > bound and cpu was scarce) > After that customer happy, time finished and didn't go further but > definitely cost was something I'd try > When I saw the presentation of CloudSearch where they explained that they > were enabling/disabling caching based on fq statistics I thought this kind > of problem were general enough that I could find a plugin already built > > 2016-01-05 19:17 GMT+01:00 Erick Erickson : > >> >> ={!cache=false}n_rea:xxx={!cache=false}provincia:,fq={!cache=false}type: >> >> You have a comma in front of the last fq clause, typo? >> >> Well, the whole point of caching filter queries is so that the >> _second_ time you use it, >> very little work has to be done. That comes at a cost of course for >> first-time execution. >> Basically any fq clause that you can guarantee won't be re-used should >> have cache=false >> set. >> >> I'd be surprised if the second time you use the provincia and type fq >> clauses not caching >> would be faster, but I've been surprised before. I guess anding two >> bitsets together could >> take more time than, say, testing a small number of individual >> documents >> >> And I'm assuming that you're testing multiple queries rather than just >> one-offs. >> >> If you _do_ know that some of your clauses are very restrictive, I >> wonder what happens if >> you add a cost in. fq's are evaluated in cost order (when >> cache=false), so what happens >> in this case? >> ={!cache=false cost=101}n_rea:xxx={!cache=false >> cost=102}provincia:={!cache=false cost=103}type: >> >> Best, >> Erick >> >> On Tue, Jan 5, 2016 at 9:41 AM, Matteo Grolla >> wrote: >> > Thanks Erik and Binoy, >> > This is a case I stumbled upon: with queries like >> > >> > >> q=*:*={!cache=false}n_rea:xxx={!cache=false}provincia:,fq={!cache=false}type: >> > >> > where n_rea filter is highly selective >> > I was able to make > 3x performance improvement disabling cache >> > >> > I think it's because the
Re: SOLR replicas performance
Hi Tomas, I give you other details. - The fragment field contains 3KB xml messages. - The queries that I used for the test are (I only change the word to search inside the fragment field between requests): curl " http://localhost:8983/solr/sepa/select?q=+fragment%3A*A*+=marked%3AT=-fragmentContentType%3ABULK=0=100=creationTimestamp+desc%2Cid+asc; - All the tests was executed inside VMs on dedicated HW in details: 2 Hypervisor ESX 5.5 on: - Server PowerEdge T420 - Dual Xeon E5-2420 with 128Gb di RAM - RAID10 local storage, 4xNear Line Sas 7.200 (about 100MB/s guaranteed bandwidth) I have executed another test with the configuration: 8 shards of 35M documents on VM1 and 8 empty shards on VM2 (CONF4). The configuration is without replica. We can now compare the response times (in seconds) for CONF2 and CONF4: - without indexing operations - CONF2 - *sequential: 12,3 **17,4* - 5 parallel: 32,5 34,2 - 10 parallel: 45,4 49 - 20 parallel: 64,6 74 - CONF4 - sequential: 5 9,1 - 5 parallel: 25 31 - 10 parallel: 41 49 - 20 parallel: 60 73 - with indexing operations - CONF2 - sequential: 12,3 19 - 5 parallel: 39 40,8 - 10 parallel: 56,6 62,9 - *20 parallel: 79 116* - CONF4 - sequential: 15,5 17,5 - 5 parallel: 30,7 38,3 - 10 parallel: 57,5 64,2 - 20 parallel: 60 81,4 During the test: - CONF2: 8 core on VM1 and 8 core on VM2 100% used (except for sequential test without indexing operations where the usage was about 80%). - CONF4: 8 core on VM1 100% used As you can see performance are similar for tests with 5 and 10 parallel requests both with during indexing operations and without indexing operations but very different with sequential requests and with 20 parallel requests. I don't understand why. Thanks, Luca On Fri, Jan 8, 2016 at 6:47 PM, Tomás Fernández Löbbewrote: > Hi Luca, > It looks like your queries are complex wildcard queries. My theory is that > you are CPU-bounded, for a single query one CPU core for each shard will be > at 100% for the duration of the sub-query. Smaller shards make these > sub-queries faster which is why 16 shards is better than 8 in your case. > * In your 16x1 configuration, you have exactly one shard per CPU core, so > in a single query, 16 subqueries will go to both nodes evenly and use one > of the CPU cores. > * In your 8x2 configuration, you still get to use one CPU core per shard, > but the shards are bigger, so maybe each subquery takes longer (for the > single query thread and 8x2 scenario I would expect CPU utilization to be > lower?). > * In your 16x2 case 16 subqueries will be distributed un-evenly, and some > node will get more than 8 subqueries, which means that some of the > subqueries will have to wait for their turn for a CPU core. In addition, > more Solr cores will be competing for resources. > If this theory is correct, adding more replicas won't speedup your queries, > you need to either get faster CPU or simplify your queries/configuration in > some way. Adding more replicas should improve your query throughput, but > only if you add them in more HW, not the same one. > > ...anyway, just a theory > > Tomás > > On Fri, Jan 8, 2016 at 7:40 AM, Shawn Heisey wrote: > > > On 1/8/2016 7:55 AM, Luca Quarello wrote: > > > I used solr5.3.1 and I sincerely expected response times with replica > > > configuration near to response times without replica configuration. > > > > > > Do you agree with me? > > > > > > I read here > > > > > > http://lucene.472066.n3.nabble.com/Solr-Cloud-Query-Scaling-td4110516.html > > > that "Queries do not need to be routed to leaders; they can be handled > by > > > any replica in a shard. Leaders are only needed for handling update > > > requests. " > > > > > > I haven't found this behaviour. In my case CONF2 e CONF3 have all > > replicas > > > on VM2 but analyzing core utilization during a request is 100% on both > > > machines. Why? > > > > Indexing is a little bit slower with replication -- the update must > > happen on all replicas. > > > > If your index is sharded (which I believe you did indicate in your > > initial message), you may find that all replicas get used even for > > queries. It is entirely possible that some of the shard subqueries will > > be processed on one replica and some of them will be processed on other > > replicas. I do not know if this commonly happens, but I would not be > > surprised if it does. If the machines are sized appropriately for the > > index, this separation should speed up queries, because you have the > > resources of multiple machines handling one query. > > > > That phrase "sized appropriately" is very important. Your initial > > message indicated that you have
Re: SolrCloud: Setting/finding node names for deleting replicas
Honestly, I have no idea which is "old". The solr source itself uses slice pretty consistently, so I stuck with that when I started the project last year. And logically, a shard being an instance of a slice makes sense to me. But one significant place where they word shard is exposed is the default names of the slices, so it’s a mixed bag. See here: https://github.com/whitepages/solrcloud_manager#terminology On 1/8/16, 2:34 PM, "Robert Brown"wrote: >Thanks for the pointer Jeff, > >For SolrCloud it turned out to be... > >=xxx > >btw, for your app, isn't "slice" old notation? > > > > >On 08/01/16 22:05, Jeff Wartes wrote: >> >> I’m pretty sure you could change the name when you ADDREPLICA using a >> core.name property. I don’t know if you can when you initially create the >> collection though. >> >> The CLUSTERSTATUS command will tell you the core names: >> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api18 >> >> That said, this tool might make things easier. >> https://github.com/whitepages/solrcloud_manager >> >> >> # shows cluster status, including core names: >> java -jar solrcloud_manager-assembly-1.4.0.jar -z zk0.example.com:2181/myapp >> >> >> # deletes a replica by node/collection/shard (figures out the core name >> under the hood) >> java -jar solrcloud_manager-assembly-1.4.0.jar deletereplica -z >> zk0.example.com:2181/myapp -c collection1 --node node1.example.com --slice >> shard2 >> >> >> I mention this tool every now and then on this list because I like it, but >> I’m the author, so take that with a pretty big grain of salt. Feedback is >> very welcome. >> >> >> >> >> >> >> >> On 1/8/16, 1:18 PM, "Robert Brown" wrote: >> >>> Hi, >>> >>> I'm having trouble identifying a replica to delete... >>> >>> I've created a 3-shard cluster, all 3 created on a single host, then >>> added a replica for shard2 onto another host, no problem so far. >>> >>> Now I want to delete the original shard, but got this error when trying >>> a *replica* param value I thought would work... >>> >>> shard2/uk available replicas are core_node1,core_node4 >>> >>> I can't find any mention of core_node1 or core_node4 via the admin UI, >>> how would I know/find the name of each one? >>> >>> Is it possible to set these names explicitly myself for easier maintenance? >>> >>> Many thanks for any guidance, >>> Rob >>> >
Re: Performance of stats=true={!cardinality=1.0}fl
On Wed, 2016-01-06 at 12:39 +0530, Modassar Ather wrote: > *q=fl1:net*=fl=50=true={!cardinality=1.0}fl* > is returning cardinality around 15 million. It is taking around 4 minutes. Is this a single shard or multiple? Anyway, you might have better luck trying the 'unique' request in JSON faceting: https://cwiki.apache.org/confluence/display/solr/Faceted+Search - Toke Eskildsen, State and University Library, Denmark
Re: SolrCloud: Setting/finding node names for deleting replicas
I’m pretty sure you could change the name when you ADDREPLICA using a core.name property. I don’t know if you can when you initially create the collection though. The CLUSTERSTATUS command will tell you the core names: https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api18 That said, this tool might make things easier. https://github.com/whitepages/solrcloud_manager # shows cluster status, including core names: java -jar solrcloud_manager-assembly-1.4.0.jar -z zk0.example.com:2181/myapp # deletes a replica by node/collection/shard (figures out the core name under the hood) java -jar solrcloud_manager-assembly-1.4.0.jar deletereplica -z zk0.example.com:2181/myapp -c collection1 --node node1.example.com --slice shard2 I mention this tool every now and then on this list because I like it, but I’m the author, so take that with a pretty big grain of salt. Feedback is very welcome. On 1/8/16, 1:18 PM, "Robert Brown"wrote: >Hi, > >I'm having trouble identifying a replica to delete... > >I've created a 3-shard cluster, all 3 created on a single host, then >added a replica for shard2 onto another host, no problem so far. > >Now I want to delete the original shard, but got this error when trying >a *replica* param value I thought would work... > >shard2/uk available replicas are core_node1,core_node4 > >I can't find any mention of core_node1 or core_node4 via the admin UI, >how would I know/find the name of each one? > >Is it possible to set these names explicitly myself for easier maintenance? > >Many thanks for any guidance, >Rob >
Re: SolrCloud: Setting/finding node names for deleting replicas
Thanks for the pointer Jeff, For SolrCloud it turned out to be... =xxx btw, for your app, isn't "slice" old notation? On 08/01/16 22:05, Jeff Wartes wrote: I’m pretty sure you could change the name when you ADDREPLICA using a core.name property. I don’t know if you can when you initially create the collection though. The CLUSTERSTATUS command will tell you the core names: https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api18 That said, this tool might make things easier. https://github.com/whitepages/solrcloud_manager # shows cluster status, including core names: java -jar solrcloud_manager-assembly-1.4.0.jar -z zk0.example.com:2181/myapp # deletes a replica by node/collection/shard (figures out the core name under the hood) java -jar solrcloud_manager-assembly-1.4.0.jar deletereplica -z zk0.example.com:2181/myapp -c collection1 --node node1.example.com --slice shard2 I mention this tool every now and then on this list because I like it, but I’m the author, so take that with a pretty big grain of salt. Feedback is very welcome. On 1/8/16, 1:18 PM, "Robert Brown"wrote: Hi, I'm having trouble identifying a replica to delete... I've created a 3-shard cluster, all 3 created on a single host, then added a replica for shard2 onto another host, no problem so far. Now I want to delete the original shard, but got this error when trying a *replica* param value I thought would work... shard2/uk available replicas are core_node1,core_node4 I can't find any mention of core_node1 or core_node4 via the admin UI, how would I know/find the name of each one? Is it possible to set these names explicitly myself for easier maintenance? Many thanks for any guidance, Rob
Re: solrcloud -How to delete a doc at a specific shard
solr version is 5.2.0, this problem is different shards with the same ID, the document router is compositeId , and if i do this ../collection/update?commit=true=idhere, then this id is missing in whole solrcloud. -- View this message in context: http://lucene.472066.n3.nabble.com/solrcloud-How-to-delete-a-doc-at-a-specific-shard-tp4249354p4249601.html Sent from the Solr - User mailing list archive at Nabble.com.
Specifying a different txn log directory
Hi, How do I specify a different directory for transaction logs? I tried using the updatelog entry in solrconfig.xml and reloaded the collection but that does not seem to work. Is there another setting I need to change? Thanks Nitin
Re: Performance of stats=true={!cardinality=1.0}fl
Hi, An input will be helpful. Thanks, Modassar On Wed, Jan 6, 2016 at 12:39 PM, Modassar Atherwrote: > Hi, > > > *q=fl1:net*=fl=50=true={!cardinality=1.0}fl* > is returning cardinality around 15 million. It is taking around 4 minutes. > Similar response time is seen with different queries which yields high > cardinality. Kindly note that the cardinality=1.0 is the desired goal. > Here in the above example the fl1 is a text field whereas fl is a docValue > enabled non-stroed, non-indexed field. > Kindly let me know if such response time is expected or I am missing > something about this feature in my query. > > Thanks, > Modassar >
date difference faceting
Hi, I have two date fields, d_a and d_b, both of type solr.TrieDateField, that represent different events associated with a particular document. The interval between these dates is relevant for corner-case statistics. The interval is calculated as the difference: sub(d_b,d_a) and I've been able to stats=true={!func}sub(d_b,d_a) What I ultimately would like to report is the interval represented as a range, which could be seen as facet.query (pseudo code) facet.query=sub(d_b,d_a)[ * TO 8640 ] // day facet.query=sub(d_b,d_a)[ 8641 TO 60480 ] // week facet.query=sub(d_b,d_a)[ 60481 TO 259200 ] // month etc. Aside from actually indexing the difference in a separate field, is there something obvious I'm missing? I'm on SOLR 5.2 in cloud mode. thanks David
Re: Solr search and index rate optimization
On Fri, 2016-01-08 at 10:55 +0500, Zap Org wrote: > i wanted to ask that i need to index after evey 15 min with hard commit > (real time records) and currently have 5 zookeeper instances and 2 solr > instances in one machine serving 200 users with 32GB RAM. whereas i wanted > to serve more than 10,000 users so what should be my machine specs and what > should be my architecture for this much serve rate along with index rate. It depends on your system and if we were forced to guess, our guess would be very loose. Fortunately you do have a running system with real queries: Make a copy on two similar machines (you will probably need more hardware anyway) and simulate growing traffic, measuring response times at appropriate points: 200 users, 500, 1000, 2000 etc. If you are very lucky, your current system scales all the way. If not, you should have enough data to make an educated guess of the amount of machines you need. You should have at least 3 measuring point to extrapolate from as scaling is not always linear. - Toke Eskildsen, State and University Library, Denmark
Re: Manage schema.xml via Solrj?
On 1/8/2016 6:30 AM, Bob Lawson wrote: > Thanks for the replies. The problem I'm trying to solve is to automate > whatever steps I can in configuring Solr for our customer. Rather than an > admin have to edit schema.xml, I thought it would be easier and less > error-prone to do it programmatically. But I'm a novice, so if there is a > better, more standard way, please let me know. Thanks!!! I personally find editing the schema.xml to be the best option, but I have not actually used the Schema API. At the point in my deployment where I was making frequent schema edits (mostly on 1.4 versions, with some of it on 3.x versions), the API did not exist. The information about this API in the reference guide looks pretty nice. > PS: What do you mean by "XY problem"? This is summarized here: https://home.apache.org/~hossman/#xyproblem Thanks, Shawn
Re: Performance of stats=true={!cardinality=1.0}fl
Hi Toke, Is this a single shard or multiple? It is 12 shard cluster without replicas and has around 90+ GB on each shard. Thanks for sharing the link. I will look into that. Regards, Modassar On Fri, Jan 8, 2016 at 4:28 PM, Toke Eskildsenwrote: > On Wed, 2016-01-06 at 12:39 +0530, Modassar Ather wrote: > > > *q=fl1:net*=fl=50=true={!cardinality=1.0}fl* > > is returning cardinality around 15 million. It is taking around 4 > minutes. > > Is this a single shard or multiple? > > Anyway, you might have better luck trying the 'unique' request in JSON > faceting: > https://cwiki.apache.org/confluence/display/solr/Faceted+Search > > - Toke Eskildsen, State and University Library, Denmark > > >
Re: Solr UIMA Custom Annotator PEAR file installation on Linux
Hi, do you mean you want to use a PEAR to provide the Annotator for the Solr UIMA UpdateProcessor ? Can you please detail a bit more your needs? Regards, Tommaso 2016-01-08 1:57 GMT+01:00 techqnq: > implemented custom annotator and generated the PEAR file. > Windos has the PEAR installer utility but how to do this from command line > or what other options on Linux OS? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-UIMA-Custom-Annotator-PEAR-file-installation-on-Linux-tp4249302.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Manage schema.xml via Solrj?
Thanks for the replies. The problem I'm trying to solve is to automate whatever steps I can in configuring Solr for our customer. Rather than an admin have to edit schema.xml, I thought it would be easier and less error-prone to do it programmatically. But I'm a novice, so if there is a better, more standard way, please let me know. Thanks!!! PS: What do you mean by "XY problem"? On Thu, Jan 7, 2016 at 11:20 PM, Erick Ericksonwrote: > I'd ask first what the high-level problem you're trying to solve is, this > could be an XY problem. > > That said, there's the Schema API you can use, see: > https://cwiki.apache.org/confluence/display/solr/Schema+API > > You can access it from the SolrJ library, see > SchemaRequest.java. For examples of using this, see: > SchemaTest.java > > to _get_ the Solr source code to see these, see: > https://wiki.apache.org/solr/HowToContribute > > Best, > Erick > > On Thu, Jan 7, 2016 at 7:01 PM, Binoy Dalal > wrote: > > I am not sure about solrj but you can use any XML parsing library to > > achieve this. > > Take a look here: > > http://www.tutorialspoint.com/java_xml/java_xml_parsers.htm > > > > On Fri, 8 Jan 2016, 08:06 Bob Lawson wrote: > > > >> I want to programmatically make changes to schema.xml using java to do > >> it. Should I use Solrj to do this or is there a better way? Can I use > >> Solrj to make the rest calls that make up the schema API? Whatever the > >> answer, can anyone point me to an example showing how to do it? Thanks! > >> > >> -- > > Regards, > > Binoy Dalal >
Re: SOLR replicas performance
Hi Erick, I used solr5.3.1 and I sincerely expected response times with replica configuration near to response times without replica configuration. Do you agree with me? I read here http://lucene.472066.n3.nabble.com/Solr-Cloud-Query-Scaling-td4110516.html that "Queries do not need to be routed to leaders; they can be handled by any replica in a shard. Leaders are only needed for handling update requests. " I haven't found this behaviour. In my case CONF2 e CONF3 have all replicas on VM2 but analyzing core utilization during a request is 100% on both machines. Why? Best, Luca *Luca Quarello* M:+39 347 018 3855 luca.quare...@xeffe.it *X**EFFE * s.r.l C.so Giovanni Lanza 72, 10131 Torino T: +39 011 660 5039 F: +39 011 198 26822 www.xeffe.it On Tue, Jan 5, 2016 at 5:08 PM, Erick Ericksonwrote: > What version of Solr? Prior to 5.2 the replicas were doing lots of > unnecessary work/being blocked, see: > > https://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/ > > Best, > Erick > > On Tue, Jan 5, 2016 at 6:09 AM, Matteo Grolla > wrote: > > Hi Luca, > > not sure if I understood well. Your question is > > "Why are index times on a solr cloud collecton with 2 replicas higher > than > > on solr cloud with 1 replica" right? > > Well with 2 replicas all docs have to be deparately indexed in 2 places > and > > solr has to confirm that both indexing went well. > > Indexing times are lower on a solrcloud collection with 2 shards (just > one > > replica, the leader, per shard) because docs are indexed just once and > the > > load is spread on 2 servers instead of one > > > > 2015-12-30 2:03 GMT+01:00 Luca Quarello : > > > >> Hi, > >> > >> I have an 260M documents index (90GB) with this structure: > >> > >> > >> >> multiValued="false" termVectors="false" termPositions="false" > >> termOffsets="false" /> > >> > >>>> multiValued="false"/> > >> > >>>> stored="true" multiValued="false"/> > >> > >>>> multiValued="false"/> > >> > >>stored="true" > >> multiValued="false"/> > >> > >>>> multiValued="false"/> > >> > >>>> multiValued="false"/> > >> > >>>> multiValued="false"/> > >> > >> > >> > >>>> multiValued="true"/> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> where the fragmetnt field contains XML messagges. > >> > >> There is a search function that provide the messagges satisfying a > search > >> criterion. > >> > >> > >> TARGET: > >> > >> To find the best configuration to optimize the response time of a two > solr > >> instances cloud with 2 VM with 8 core and 32 GB > >> > >> > >> TEST RESULTS: > >> > >> > >>1. > >> > >>Configurations: > >>1. > >> > >> the better configuration without replicas > >> - CONF1: 16 shards of 17M documents (8 per VM) > >> 1. > >> > >> configuration with replica > >> - CONF 2: 8 shards of 35M documents with replication factor of 1 > >> - CONF 3: 16 shards of 35M documents with replication factor > of 1 > >> > >> > >> > >>1. > >> > >>Executed tests > >> > >> > >>- sequential requests > >> - 5 parallel requests > >> - 10 parallel requests > >> - 20 parallel requests > >> > >> in two scenarios: during an indexing phase and not > >> > >> > >> Call are: http://localhost:8983/solr/sepa/select? > >> q=+fragment%3A*AAA*+=marked%3AT=-fragmentContentType > >> %3ABULK=0=100=creationTimestamp+desc%2Cid+asc > >> > >> > >>1. > >> > >>Test results > >> > >>All the test have point out an I/O utilization of 100MB/s > during > >> > >> loading data on disk cache, disk cache utilization of 20GB and core > >> utilization of 100% (all 8 cores) > >> > >> > >> > >>- > >> > >>No indexing > >>- > >> > >> CONF1 (time average and maximum time) > >> - > >> > >> sequential: 4,1 6,9 > >> - > >> > >> 5 parallel: 15,6 19,1 > >> - > >> > >> 10 parallel: 23,6 30,2 > >> - > >> > >> 20 parallel: 48 52,2 > >> - > >> > >> CONF2 > >> - > >> > >> sequential: 12,3 17,4 > >> - > >> > >> 5 parallel: 32,5 34,2 > >> - > >> > >> 10 parallel: 45,4 49 > >> - > >> > >> 20 parallel: 64,6 74 > >> - > >> > >> CONF3 > >> - > >> > >> sequential: 6,9 9,9 > >> - > >> > >> 5 parallel: 33,2 37,5 > >> - > >> > >> 10 parallel: 46 51 > >> - > >> > >> 20 parallel: 68 83 > >> > >> > >> > >>- > >> > >>Indexing (into the solr admin console is it possible to view the > >> total throughput? > >>I find it only relative to a single shard). > >> > >> > >> CONF1 > >> > >>- > >> > >> sequential: 7,7 9,5 > >> - > >> > >> 5 parallel: 26,8 28,4 > >> - > >> > >>
Re: SOLR replicas performance
Hi Erick, I used solr5.3.1 and I sincerely expected response times with replica configuration near to response times without replica configuration. Do you agree with me? I read here http://lucene.472066.n3.nabble.com/Solr-Cloud-Query-Scaling-td4110516.html that "Queries do not need to be routed to leaders; they can be handled by any replica in a shard. Leaders are only needed for handling update requests. " I haven't found this behaviour. In my case CONF2 e CONF3 have all replicas on VM2 but analyzing core utilization during a request is 100% on both machines. Why? Best, Luca On Tue, Jan 5, 2016 at 5:08 PM, Erick Ericksonwrote: > What version of Solr? Prior to 5.2 the replicas were doing lots of > unnecessary work/being blocked, see: > > https://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/ > > Best, > Erick > > On Tue, Jan 5, 2016 at 6:09 AM, Matteo Grolla > wrote: > > Hi Luca, > > not sure if I understood well. Your question is > > "Why are index times on a solr cloud collecton with 2 replicas higher > than > > on solr cloud with 1 replica" right? > > Well with 2 replicas all docs have to be deparately indexed in 2 places > and > > solr has to confirm that both indexing went well. > > Indexing times are lower on a solrcloud collection with 2 shards (just > one > > replica, the leader, per shard) because docs are indexed just once and > the > > load is spread on 2 servers instead of one > > > > 2015-12-30 2:03 GMT+01:00 Luca Quarello : > > > >> Hi, > >> > >> I have an 260M documents index (90GB) with this structure: > >> > >> > >> >> multiValued="false" termVectors="false" termPositions="false" > >> termOffsets="false" /> > >> > >>>> multiValued="false"/> > >> > >>>> stored="true" multiValued="false"/> > >> > >>>> multiValued="false"/> > >> > >>stored="true" > >> multiValued="false"/> > >> > >>>> multiValued="false"/> > >> > >>>> multiValued="false"/> > >> > >>>> multiValued="false"/> > >> > >> > >> > >>>> multiValued="true"/> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> where the fragmetnt field contains XML messagges. > >> > >> There is a search function that provide the messagges satisfying a > search > >> criterion. > >> > >> > >> TARGET: > >> > >> To find the best configuration to optimize the response time of a two > solr > >> instances cloud with 2 VM with 8 core and 32 GB > >> > >> > >> TEST RESULTS: > >> > >> > >>1. > >> > >>Configurations: > >>1. > >> > >> the better configuration without replicas > >> - CONF1: 16 shards of 17M documents (8 per VM) > >> 1. > >> > >> configuration with replica > >> - CONF 2: 8 shards of 35M documents with replication factor of 1 > >> - CONF 3: 16 shards of 35M documents with replication factor > of 1 > >> > >> > >> > >>1. > >> > >>Executed tests > >> > >> > >>- sequential requests > >> - 5 parallel requests > >> - 10 parallel requests > >> - 20 parallel requests > >> > >> in two scenarios: during an indexing phase and not > >> > >> > >> Call are: http://localhost:8983/solr/sepa/select? > >> q=+fragment%3A*AAA*+=marked%3AT=-fragmentContentType > >> %3ABULK=0=100=creationTimestamp+desc%2Cid+asc > >> > >> > >>1. > >> > >>Test results > >> > >>All the test have point out an I/O utilization of 100MB/s > during > >> > >> loading data on disk cache, disk cache utilization of 20GB and core > >> utilization of 100% (all 8 cores) > >> > >> > >> > >>- > >> > >>No indexing > >>- > >> > >> CONF1 (time average and maximum time) > >> - > >> > >> sequential: 4,1 6,9 > >> - > >> > >> 5 parallel: 15,6 19,1 > >> - > >> > >> 10 parallel: 23,6 30,2 > >> - > >> > >> 20 parallel: 48 52,2 > >> - > >> > >> CONF2 > >> - > >> > >> sequential: 12,3 17,4 > >> - > >> > >> 5 parallel: 32,5 34,2 > >> - > >> > >> 10 parallel: 45,4 49 > >> - > >> > >> 20 parallel: 64,6 74 > >> - > >> > >> CONF3 > >> - > >> > >> sequential: 6,9 9,9 > >> - > >> > >> 5 parallel: 33,2 37,5 > >> - > >> > >> 10 parallel: 46 51 > >> - > >> > >> 20 parallel: 68 83 > >> > >> > >> > >>- > >> > >>Indexing (into the solr admin console is it possible to view the > >> total throughput? > >>I find it only relative to a single shard). > >> > >> > >> CONF1 > >> > >>- > >> > >> sequential: 7,7 9,5 > >> - > >> > >> 5 parallel: 26,8 28,4 > >> - > >> > >> 10 parallel: 31,8 37,8 > >> - > >> > >> 20 parallel: 42 52,5 > >> - > >> > >>CONF2 > >>- > >> > >> sequential: 12,3 19 > >> - > >> >
RE: Manage schema.xml via Solrj?
Bob, XY problem means that you are presenting the imagined solution without presenting the problem to solve. In other words, you are presenting X (solve for X), without a full statement of the equation to be solved for X. My guess at your problem is the same as my problem - editing Solr configuration (schema and solrconfig.xml) as files is very flexible and Agile compared to a form based solution, but that comes with the downside that anyone can "crash" a Solr collection by editing the schema wrong. This goes beyond just XML syntax checking, obviously.But only Solr is the authority on what a good schema (and other configuration) should look like. I'm working on a tool that can provide a bit of "smoke testing" on a Solr configuration directory. The workflow I envision is like this: 1. DEVELOPER, TEAM LEAD, or SOLR ADMIN MAKE CHANGES TO CONFIGURATION DIRECTORY In the beginning, they may need to make lots of changes. Eventually, they are only making small changes, but we don't want those Small changes to crash anything. 2. DEVELOPER, TEAM LEAD, or SOLR ADMIN TRIGGER CONTINUOS INTEGRATION When they push or merge to a git branch, that may trigger a CI workflow. The workflow works like this: 2a. Run the "smoke test" tool to (a) create a temporary configset in Zookeeper, (b) create a temporary collection in SolrCloud, and (c) do simple indexing. 2b. Use zkCli.sh and solr.sh to update the actual configset and collection in SolrCloud. 3. ITERATE This can happen again and again with a "staging", "QA", "Production" set of branches.Other checks can be put into the CI workflow as well. So, along the way to having this vision (of my solution), I also considered the advantage of schemaless systems. I don't want to throw stones, but I think schemaless is mostly a marketing term for a couple of reasons: - I do Linked Data/RDF - it is different from SQL, but not schemaless. If your "vocabulary" is badly designed, then your users will have problems. - ElasticSearch is not really schemaless. Any ElasticSearch conference is filled with tracks/sessions on how to get your "field mappings" right, and what happens if you don't (too big indexes, need to re-index to fix stuff, etc.) - IBM Watson Explorer is not really schemaless - your update document has to specify the type and treatment of each field, or your XSLT must transform your document into a structure that does so. Many of us have also seen what happens with non-dernormalized SQL or fully normalized SQL. "Schemafull" ought to be a marketing term as well. -Original Message- From: Bob Lawson [mailto:bwlawson...@gmail.com] Sent: Friday, January 08, 2016 8:30 AM To: solr-user@lucene.apache.org Subject: Re: Manage schema.xml via Solrj? Thanks for the replies. The problem I'm trying to solve is to automate whatever steps I can in configuring Solr for our customer. Rather than an admin have to edit schema.xml, I thought it would be easier and less error-prone to do it programmatically. But I'm a novice, so if there is a better, more standard way, please let me know. Thanks!!! PS: What do you mean by "XY problem"? On Thu, Jan 7, 2016 at 11:20 PM, Erick Ericksonwrote: > I'd ask first what the high-level problem you're trying to solve is, > this could be an XY problem. > > That said, there's the Schema API you can use, see: > https://cwiki.apache.org/confluence/display/solr/Schema+API > > You can access it from the SolrJ library, see SchemaRequest.java. For > examples of using this, see: > SchemaTest.java > > to _get_ the Solr source code to see these, see: > https://wiki.apache.org/solr/HowToContribute > > Best, > Erick > > On Thu, Jan 7, 2016 at 7:01 PM, Binoy Dalal > wrote: > > I am not sure about solrj but you can use any XML parsing library to > > achieve this. > > Take a look here: > > http://www.tutorialspoint.com/java_xml/java_xml_parsers.htm > > > > On Fri, 8 Jan 2016, 08:06 Bob Lawson wrote: > > > >> I want to programmatically make changes to schema.xml using java to > >> do it. Should I use Solrj to do this or is there a better way? > >> Can I use Solrj to make the rest calls that make up the schema API? > >> Whatever the answer, can anyone point me to an example showing how to do > >> it? Thanks! > >> > >> -- > > Regards, > > Binoy Dalal >
Re: SOLR replicas performance
Hi Matteo, the questions are two: - "Why are response times on a solr cloud collecton with 1 replica higher than on solr cloud without replica" Configuration1: solrCloud with two 8 cores VMs each with 8 shards of 17M docs Configuration2: solrClous with two 8 cores VMs each with 8 shards of 17M docs (8 master and 8 replicas) I registered worst response time for replicas configuration (conf2) when: - Scenario1: I do queries without inserting record into the index - Scenario2: I do queries inserting record into the index I expect similar response times in Scenario1 and better response times for configuration2 in Scenario2. Is it correct? Thanks, Luca On Fri, Jan 8, 2016 at 3:56 PM, Luca Quarellowrote: > Hi Erick, > I used solr5.3.1 and I sincerely expected response times with replica > configuration near to response times without replica configuration. > > Do you agree with me? > > I read here > http://lucene.472066.n3.nabble.com/Solr-Cloud-Query-Scaling-td4110516.html > that > "Queries do not need to be routed to leaders; they can be handled by any > replica in a shard. Leaders are only needed for handling update requests. > " > > I haven't found this behaviour. In my case CONF2 e CONF3 have all replicas > on VM2 but analyzing core utilization during a request is 100% on both > machines. Why? > > Best, > Luca > > On Tue, Jan 5, 2016 at 5:08 PM, Erick Erickson > wrote: > >> What version of Solr? Prior to 5.2 the replicas were doing lots of >> unnecessary work/being blocked, see: >> >> https://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/ >> >> Best, >> Erick >> >> On Tue, Jan 5, 2016 at 6:09 AM, Matteo Grolla >> wrote: >> > Hi Luca, >> > not sure if I understood well. Your question is >> > "Why are index times on a solr cloud collecton with 2 replicas higher >> than >> > on solr cloud with 1 replica" right? >> > Well with 2 replicas all docs have to be deparately indexed in 2 places >> and >> > solr has to confirm that both indexing went well. >> > Indexing times are lower on a solrcloud collection with 2 shards (just >> one >> > replica, the leader, per shard) because docs are indexed just once and >> the >> > load is spread on 2 servers instead of one >> > >> > 2015-12-30 2:03 GMT+01:00 Luca Quarello : >> > >> >> Hi, >> >> >> >> I have an 260M documents index (90GB) with this structure: >> >> >> >> >> >> > >> multiValued="false" termVectors="false" termPositions="false" >> >> termOffsets="false" /> >> >> >> >> > >> multiValued="false"/> >> >> >> >> > >> stored="true" multiValued="false"/> >> >> >> >> > >> multiValued="false"/> >> >> >> >> > stored="true" >> >> multiValued="false"/> >> >> >> >> > >> multiValued="false"/> >> >> >> >> > >> multiValued="false"/> >> >> >> >> > >> multiValued="false"/> >> >> >> >> >> >> >> >> > >> multiValued="true"/> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> where the fragmetnt field contains XML messagges. >> >> >> >> There is a search function that provide the messagges satisfying a >> search >> >> criterion. >> >> >> >> >> >> TARGET: >> >> >> >> To find the best configuration to optimize the response time of a two >> solr >> >> instances cloud with 2 VM with 8 core and 32 GB >> >> >> >> >> >> TEST RESULTS: >> >> >> >> >> >>1. >> >> >> >>Configurations: >> >>1. >> >> >> >> the better configuration without replicas >> >> - CONF1: 16 shards of 17M documents (8 per VM) >> >> 1. >> >> >> >> configuration with replica >> >> - CONF 2: 8 shards of 35M documents with replication factor of 1 >> >> - CONF 3: 16 shards of 35M documents with replication factor >> of 1 >> >> >> >> >> >> >> >>1. >> >> >> >>Executed tests >> >> >> >> >> >>- sequential requests >> >> - 5 parallel requests >> >> - 10 parallel requests >> >> - 20 parallel requests >> >> >> >> in two scenarios: during an indexing phase and not >> >> >> >> >> >> Call are: http://localhost:8983/solr/sepa/select? >> >> q=+fragment%3A*AAA*+=marked%3AT=-fragmentContentType >> >> %3ABULK=0=100=creationTimestamp+desc%2Cid+asc >> >> >> >> >> >>1. >> >> >> >>Test results >> >> >> >>All the test have point out an I/O utilization of 100MB/s >> during >> >> >> >> loading data on disk cache, disk cache utilization of 20GB and core >> >> utilization of 100% (all 8 cores) >> >> >> >> >> >> >> >>- >> >> >> >>No indexing >> >>- >> >> >> >> CONF1 (time average and maximum time) >> >> - >> >> >> >> sequential: 4,1 6,9 >> >> - >> >> >> >> 5 parallel: 15,6 19,1 >> >> - >> >> >> >> 10 parallel: 23,6 30,2 >> >> - >> >> >> >> 20 parallel: 48 52,2 >> >> - >> >> >> >> CONF2
Re: Solr UIMA Custom Annotator PEAR file installation on Linux
Yes, I want to use PEAR file to provide my custom annotator for the solr UIMA UpdateProcessor. Basically I have written a custom annotator to capture the certain type of data from "content" and copies over to other solr field. Generated the PEAR file using eclipse UIMA plugins. All well till now. Now I want to use this PEAR file on my solr server to provide this annotator for the SOLR UIMA UpdateProcessor. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-UIMA-Custom-Annotator-PEAR-file-installation-on-Linux-tp4249302p4249496.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Manage schema.xml via Solrj?
Bob, Not sure why you would want to do this. You can set up Solr to guess the schema. It creates a file called manage_schema.xml for an override. This is the case with 5.3 I came across it by accident setting it up the first time and I was a little annoyed but it made for a quick setup. Your programming would still need to realise the new doc structure and use that new document structure. The only problem is it's a bit generic in the guess work and I did not spend much time testing it out so I am not really versed in operating it. I got myself mack to schema.xml ASAP. My thoughts are you are looking at a lot of work for little gain. Best, GW On 7 January 2016 at 21:36, Bob Lawsonwrote: > I want to programmatically make changes to schema.xml using java to do > it. Should I use Solrj to do this or is there a better way? Can I use > Solrj to make the rest calls that make up the schema API? Whatever the > answer, can anyone point me to an example showing how to do it? Thanks! > >
Re: Manage schema.xml via Solrj?
First, Daniel nailed the XY problem, but this isn't that... You're correct that hand-editing the schema file is error-prone. The managed schema API is your friend here. There are several commercial front-ends that already do this. The managed schema API is all just HTTP, so there's nothing precluding a Java program from interpreting a form and sending off the proper HTTP requests to modify the schema. The SolrJ client library has some sugar around this, there's no reason you can't use that as it's just a jar (and a dependency on a logging jar). For SolrCloud it's a little different. You need to make sure your changes get to Zookeeper, which the schema API will handle for you. One thing that's a bit confusing is "managed schema" and "schemaless". They both use the same underlying mechanism to modify the schema.xml file. With "managed schema" you do what you're talking about, have some process where you make specific modifications with the schema API to a controlled schema file. "schemaless" automatically tries to guess what the schema _should_ be and uses the managed schema API to implement those guesses. GW: Schema guessing is a great way to get things started, but virtually every organization I work with takes explicit control of the schema. They do this for three reasons: 1> the assumptions in managed schema create indexes that can be made much smaller by judicious options on the fields. 2> the search cases require careful analysis chains. 3> the guesses are wrong. I.e. if the first number encountered in a field is, say, 3 and the guessing says "Oh, this is an int field". The next doc is 3.4.. you'll get a parsing error and fail to index the doc. Best, Erick On Fri, Jan 8, 2016 at 7:38 AM, GWwrote: > Bob, > > Not sure why you would want to do this. You can set up Solr to guess the > schema. It creates a file called manage_schema.xml for an override. This is > the case with 5.3 I came across it by accident setting it up the first time > and I was a little annoyed but it made for a quick setup. Your programming > would still need to realise the new doc structure and use that new document > structure. The only problem is it's a bit generic in the guess work and I > did not spend much time testing it out so I am not really versed in > operating it. I got myself mack to schema.xml ASAP. My thoughts are you are > looking at a lot of work for little gain. > > Best, > > GW > > > > On 7 January 2016 at 21:36, Bob Lawson wrote: > >> I want to programmatically make changes to schema.xml using java to do >> it. Should I use Solrj to do this or is there a better way? Can I use >> Solrj to make the rest calls that make up the schema API? Whatever the >> answer, can anyone point me to an example showing how to do it? Thanks! >> >>
Re: solrcloud -How to delete a doc at a specific shard
This simply shouldn't be the case if by "duplicate" you mean it has the same id (i.e. the field defined as the uniqueKey in schema.xml). If you do have docs in different shards with the same ID, then something is very strange about your setup. What version of Solr BTW? Assuming you mean "same content but different IDs" then you can delete by ID either through SolrJ or on the URL .../collection/update?commit=true=idhere Best, Erick On Fri, Jan 8, 2016 at 12:52 AM, elvis鱼人wrote: > my solrcloud,3 shards,and 2replica, > and one shard docs is duplicate,the document router is compositeId > who can help me? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/solrcloud-How-to-delete-a-doc-at-a-specific-shard-tp4249354.html > Sent from the Solr - User mailing list archive at Nabble.com.
Kerberos ticket not renewing when storing index on Kerberized HDFS
Hello, I have Solr Cloud configured to stores its index files on a Kerberized HDFS (I followed documentation at https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS), and have been able to index some documents with the files being written to the HDFS as expected. However, it appears that some time after starting, Solr is unable to connect to HDFS as it no longer has a valid Kerberos TGT. The time-frame of this occurring is consistent with my default Kerberos ticket lifetime of 24 hours, so it appears as though Solr is not renewing its Kerberos ticket upon expiry. A restart of Solr resolves the issue again for 24 hours. Is there any configuration I can add to make Solr automatically renew its ticket or is this an issue with Solr? The following is the stack trace I am getting in Solr. java.io.IOException: Failed on local exception: java.io.IOException: Couldn't setup connection for solr/sandbox.hortonworks@hortonworks.com to sandbox.hortonworks.com/10.0.2.15:8020; Host Details : local host is: " sandbox.hortonworks.com/10.0.2.15"; destination host is: " sandbox.hortonworks.com":8020; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1472) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy10.renewLease(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:571) at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy11.renewLease(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:879) at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417) at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442) at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71) at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Couldn't setup connection for solr/ sandbox.hortonworks@hortonworks.com to sandbox.hortonworks.com/10.0.2.15:8020 at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:672) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:730) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) at org.apache.hadoop.ipc.Client.call(Client.java:1438) ... 16 more Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212) at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413) at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:553) at org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:368) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:722) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:718) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:717) ... 19 more Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147) at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121) at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187) at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212) at
Re: date difference faceting
I'm going to side-step your primary question and say that it's nearly always best to do your calculations up-front during indexing to make queries more efficient and thus serve more requests on the same hardware. This assumes that the stat you're interested in is predictable of course... Best, Erick On Fri, Jan 8, 2016 at 2:23 AM, David Santamaurowrote: > > Hi, > > I have two date fields, d_a and d_b, both of type solr.TrieDateField, that > represent different events associated with a particular document. The > interval between these dates is relevant for corner-case statistics. The > interval is calculated as the difference: sub(d_b,d_a) and I've been able to > > stats=true={!func}sub(d_b,d_a) > > What I ultimately would like to report is the interval represented as a > range, which could be seen as facet.query > > (pseudo code) > facet.query=sub(d_b,d_a)[ * TO 8640 ] // day > facet.query=sub(d_b,d_a)[ 8641 TO 60480 ] // week > facet.query=sub(d_b,d_a)[ 60481 TO 259200 ] // month > etc. > > Aside from actually indexing the difference in a separate field, is there > something obvious I'm missing? I'm on SOLR 5.2 in cloud mode. > > thanks > David
Re: SOLR replicas performance
On 1/8/2016 7:55 AM, Luca Quarello wrote: > I used solr5.3.1 and I sincerely expected response times with replica > configuration near to response times without replica configuration. > > Do you agree with me? > > I read here > http://lucene.472066.n3.nabble.com/Solr-Cloud-Query-Scaling-td4110516.html > that "Queries do not need to be routed to leaders; they can be handled by > any replica in a shard. Leaders are only needed for handling update > requests. " > > I haven't found this behaviour. In my case CONF2 e CONF3 have all replicas > on VM2 but analyzing core utilization during a request is 100% on both > machines. Why? Indexing is a little bit slower with replication -- the update must happen on all replicas. If your index is sharded (which I believe you did indicate in your initial message), you may find that all replicas get used even for queries. It is entirely possible that some of the shard subqueries will be processed on one replica and some of them will be processed on other replicas. I do not know if this commonly happens, but I would not be surprised if it does. If the machines are sized appropriately for the index, this separation should speed up queries, because you have the resources of multiple machines handling one query. That phrase "sized appropriately" is very important. Your initial message indicated that you have a 90GB index, and that you are running in virtual machines. Typically VMs have fairly small memory sizes. It is very possible that you simply don't have enough memory in the VM for good performance with an index that large. With 90GB of index data on one machine, I would hope for at least 64GB of RAM, and I would prefer to have 128GB. If there is more than 90GB of data on one machine, then even more memory would be needed. Thanks, Shawn
Re: Solr search and index rate optimization
Here's a longer form of Toke's answer: https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ BTW, on the surface, having 5 ZK nodes isn't doing you any real good. Zookeeper isn't really involved in serving queries or handling updates, it's purpose is to have the state of the cluster (nodes up, recovering, down, etc) and notify Solr listeners when that state changes. There's no good reason to have 5 with a small cluster and by "small" I mean < 100s of nodes. Best, Erick On Fri, Jan 8, 2016 at 2:40 AM, Toke Eskildsenwrote: > On Fri, 2016-01-08 at 10:55 +0500, Zap Org wrote: >> i wanted to ask that i need to index after evey 15 min with hard commit >> (real time records) and currently have 5 zookeeper instances and 2 solr >> instances in one machine serving 200 users with 32GB RAM. whereas i wanted >> to serve more than 10,000 users so what should be my machine specs and what >> should be my architecture for this much serve rate along with index rate. > > It depends on your system and if we were forced to guess, our guess > would be very loose. > > > Fortunately you do have a running system with real queries: Make a copy > on two similar machines (you will probably need more hardware anyway) > and simulate growing traffic, measuring response times at appropriate > points: 200 users, 500, 1000, 2000 etc. > > If you are very lucky, your current system scales all the way. If not, > you should have enough data to make an educated guess of the amount of > machines you need. You should have at least 3 measuring point to > extrapolate from as scaling is not always linear. > > - Toke Eskildsen, State and University Library, Denmark > >