Re: What is “high cardinality” in facet streams?
Right now we’re sharding the collection as we hit performance issues in the past with legacy Solr (i.e. a single Solr core), and also we’re experimenting a bit to see which replication factor we can get away with (in terms of resources and cost). Unfortunately, PSQL isn’t yet an option due to the lack of point field support, which we’re using in our schema (https://issues.apache.org/jira/browse/SOLR-10427). Thanks for pointing at the parallel function. What I don’t understand, though, is if I don’t use the parallel decorator, my query isn’t distributed across my cluster nodes (e.g. I have four shards and no replicas)? > On 22 Feb 2018, at 03:01, Joel Bernstein wrote: > > With Streaming Expressions you have options for speeding up large > aggregations. > > 1) Shard > 2) Use the parallel function to run the aggregation in parallel. > 3) Add more replicas > > When you use the parallel function the same aggregation can be pulled from > every shard and every shard replica in the cluster. > > The parallel SQL interface supports a map_reduce aggregation mode where you > can specific then number of parallel workers. If a SQL group by query works > for you that might be the easiest way to go. The docs have good coverage of > this topic. > > > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Wed, Feb 21, 2018 at 8:43 PM, Shawn Heisey wrote: > >> On 2/21/2018 12:08 PM, Alfonso Muñoz-Pomer Fuentes wrote: >>> Some more details about my collection: >>> - Approximately 200M documents >>> - 1.2M different values in the field I’m faceting over >>> >>> The query I’m doing is over a single bucket, which after applying q and >> fq the 1.2M values are reduced to, at most 60K (often times half that >> value). From your replies I assume I’m not going to hit a bottleneck any >> time soon. Thanks a lot. >> >> Two hundred million documents is going to be a pretty big index even if >> the documents are small. The server is going to need a lot of spare >> memory (not assigned to programs) for good general performance. >> >> As I understand it, facet performance is going to be heavily determined >> by the 1.2 million unique values in the field you're using. Facet >> performance is probably going to be very similar whether your query >> matches 60K or 1 million. >> >> Thanks, >> Shawn >> >> -- Alfonso Muñoz-Pomer Fuentes Senior Lead Software Engineer @ Expression Atlas Team European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Tel:+ 44 (0) 1223 49 2633 Skype: amunozpomer
Re: What is “high cardinality” in facet streams?
All in all the index is about 250GB and it’s sharded in two dedicated VMs with 24GB of memory and it’s performing ok so far (queries take about 7 seconds, the worst cases about 10). At some point in the past we needed to transition to SolrCloud because a single Solr core, of course, wouldn’t scale. > On 22 Feb 2018, at 01:43, Shawn Heisey wrote: > > On 2/21/2018 12:08 PM, Alfonso Muñoz-Pomer Fuentes wrote: >> Some more details about my collection: >> - Approximately 200M documents >> - 1.2M different values in the field I’m faceting over >> >> The query I’m doing is over a single bucket, which after applying q and fq >> the 1.2M values are reduced to, at most 60K (often times half that value). >> From your replies I assume I’m not going to hit a bottleneck any time soon. >> Thanks a lot. > > Two hundred million documents is going to be a pretty big index even if > the documents are small. The server is going to need a lot of spare > memory (not assigned to programs) for good general performance. > > As I understand it, facet performance is going to be heavily determined > by the 1.2 million unique values in the field you're using. Facet > performance is probably going to be very similar whether your query > matches 60K or 1 million. > > Thanks, > Shawn > -- Alfonso Muñoz-Pomer Fuentes Senior Lead Software Engineer @ Expression Atlas Team European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Tel:+ 44 (0) 1223 49 2633 Skype: amunozpomer
AW: Sort by nested field but only in matching nested documents
Thanks for your answer, Mikhail. Florian -Ursprüngliche Nachricht- Von: Mikhail Khludnev [mailto:m...@apache.org] Gesendet: Dienstag, 6. Februar 2018 11:44 An: solr-user Betreff: Re: Sort by nested field but only in matching nested documents Hello Florian, No. As an alternative you can put it into q param, suppressing scoring from undesired clauses with ^=0 On Thu, Feb 1, 2018 at 5:22 PM, Florian Fankhauser wrote: > Hello, > given the following document structure (books as parent, libraries > having these books as children): > > > > > book > 1000 > Mr. Mercedes > Stephen King > > library > 1000/100 > 20160810 > Innsbruck > > > library > 1000/101 > 20180103 > Hall > > > > book > 1001 > Noah > Sebastian Fitzek > > library > 1001/100 > 20170810 > Innsbruck > > > > > > Now i want to get all books located in libraries in city "Innsbruck", > sorted by acquisition date descending. > In other words: i want to filter in the field city_i in the child > documents, but return only the parent document. And i want to sort by > the field acquisition_date_i in the child documents in descending > order, newest first. > > My first try: > - > > URL: > http://localhost:8983/solr/test1/select?q=title_t:*&fq={! > parent%20which=doc_type_s:book}city_t:Innsbruck&sort={! > parent%20which=doc_type_s:book%20score=max%20v=%27% > 2Bdoc_type_s:library%20%2B{!func}acquisition_date_i%27}%20desc > > URL params decoded: > q=title_t:* > fq={!parent which=doc_type_s:book}city_t:Innsbruck > sort={!parent which=doc_type_s:book score=max v='+doc_type_s:library > +{!func}acquisition_date_i'} desc > > Result: > { > "responseHeader":{ > "status":0, > "QTime":4, > "params":{ > "q":"title_t:*", > "fq":"{!parent which=doc_type_s:book}city_t:Innsbruck", > "sort":"{!parent which=doc_type_s:book score=max > v='+doc_type_s:library +{!func}acquisition_date_i'} desc"}}, > "response":{"numFound":2,"start":0,"docs":[ > { > "doc_type_s":"book", > "text":["book", > "Mr. Mercedes", > "Stephen King"], > "id":"1000", > "title_t":"Mr. Mercedes", > "title_t_fac":"Mr. Mercedes", > "autor_t":"Stephen King", > "autor_t_fac":"Stephen King", > "_version_":1591205521252155392}, > { > "doc_type_s":"book", > "text":["book", > "Noah", > "Sebastian Fitzek"], > "id":"1001", > "title_t":"Noah", > "title_t_fac":"Noah", > "autor_t":"Sebastian Fitzek", > "autor_t_fac":"Sebastian Fitzek", > "_version_":1591205521256349696}] > }} > > The result is wrong, because "Noah" should be before "Mr. Mercedes" in > the list. The reason is, i guess, because "Mr. Mercedes" has another > child document with a newer acquisition_date. But this child document > is not in city "Innsbruck" and should not influence the sorting. > > So i tried to add the city-filter to the sort-parameter as well in my > second try: > - > > URL: > http://localhost:8983/solr/test1/select?q=title_t:*&fq={! > parent%20which=doc_type_s:book}city_t:Innsbruck&sort={! > parent%20which=doc_type_s:book%20score=max%20v=%27% > 2Bdoc_type_s:library%20%2Bcity_t:Innsbruck%20%2B{! > func}acquisition_date_i%27}%20desc > > URL params decoded: > q=title_t:* > fq={!parent which=doc_type_s:book}city_t:Innsbruck > sort={!parent which=doc_type_s:book score=max v='+doc_type_s:library > +city_t:Innsbruck +{!func}acquisition_date_i'} desc > > (I added "+city_t:Innsbruck" to the sort param) > > Result: > { > "responseHeader":{ > "status":0, > "QTime":3, > "params":{ > "q":"title_t:*", > "fq":"{!parent which=doc_type_s:book}city_t:Innsbruck", > "sort":"{!parent which=doc_type_s:book score=max > v='+doc_type_s:library +city_t:Innsbruck +{!func}acquisition_date_i'} > desc"}}, > "response":{"numFound":2,"start":0,"docs":[ > { > "doc_type_s":"book", > "text":["book", > "Noah", > "Sebastian Fitzek"], > "id":"1001", > "title_t":"Noah", > "title_t_fac":"Noah", > "autor_t":"Sebastian Fitzek", > "autor_t_fac":"Sebastian Fitzek", > "_version_":1591205521256349696}, > { > "doc_type_s":"book", > "text":["book", > "Mr. Mercedes", > "Stephen King"], > "id":"1000", > "title_t":"Mr. Mercedes", > "title_t_fac":"Mr. Mercedes", > "autor_t":"Stephen King", > "autor_t_fac":"Stephen King", > "_version_
Problems with DocExpirationUpdateProcessor with Secured SolrCloud
Hi, We recently setup a 7.2.1 cloud with the intent to have the documents be automatically deleted from the collection using the DocExpirationUpdateProcessorFactory. We also have the cloud secured using the BasicAuthenticationPlugin. Our current config settings are below. The deployment is 3 nodes, each with a single solr instance which host a single replica for the collection. The collection itself only has 1 shard, so we have 3 copies (all NRT) of the same index. What keeps happening is that the follower replicas end up being published in a down state by the leader replica on the first autoDelete pass since it doesn't authenticate the distributed updates. Relevant log dump: https://pastebin.com/ZtirJLSu Is there something that we were missing when we set this up? Besides the replicas going down, the processor works as expected on the leader replica. Thanks, Chris +300SECONDS doc-expiration-processor-chain _expireat_ _ttl_ 300
Re: Limit search queries only to pull replicas
Hi, The use case for this is that our indexing node has more shards than it has CPU cores it is enough for indexing, but not enough to serve the search queries if those queries are heavy. To put it out of serving requests we are using in-house solution that routes the queries to pull replicas based on information from zookeeper. Ere, thanks for the patch, looking forward to try it. Regards Stanislav > 14 февр. 2018 г., в 18:18, Ere Maijala написал(а): > > I've now posted https://issues.apache.org/jira/browse/SOLR-11982 with a > patch. It works just like preferLocalShards. SOLR-10880 is awesome, but my > idea is not to filter out anything, so this just adjusts the order of nodes. > > --Ere > > Tomas Fernandez Lobbe kirjoitti 8.1.2018 klo 21.42: >> This feature is not currently supported. I was thinking in implementing it >> by extending the work done in SOLR-10880. I still didn’t have time to work >> on it though. There is a patch for SOLR-10880 that doesn’t implement >> support for replica types, but could be used as base. >> Tomás >>> On Jan 8, 2018, at 12:04 AM, Ere Maijala wrote: >>> >>> Server load alone doesn't always indicate the server's ability to serve >>> queries. Memory and cache state are important too, and they're not as easy >>> to monitor. Additionally, server load at any single point in time or a >>> short term average is not indicative of the server's ability to handle >>> search requests if indexing happens in short but intense bursts. >>> >>> It can also complicate things if there are more than one Solr instance >>> running on a single server. >>> >>> I'm definitely not against intelligent routing. In many cases it makes >>> perfect sense, and I'd still like to use it, just limited to the pull >>> replicas. >>> >>> --Ere >>> >>> Erick Erickson kirjoitti 5.1.2018 klo 19.03: Actually, I think a much better option is to route queries to server load. The theory of preferring pull replicas to leaders would be that the leader will be doing the indexing work and the pull replicas would be doing less work therefore serving queries faster. But that's a fragile assumption. Let's say indexing stops totally. Now your leader is sitting there idle when it could be serving queries. The autoscaling work will allow for more intelligent routing, you can monitor the CPU load on your servers and if the leader has some spare cycles use them .vs. crudely routing all queries to pull replicas (or tlog replicas for that matter). NOTE: I don't know whether this is being actively worked on or not, but seems a logical extension of the increased monitoring capabilities being put in place for autoscaling, but I'd rather see effort put in there than support routing based solely on a node's type. Best, Erick On Fri, Jan 5, 2018 at 7:51 AM, Emir Arnautović < emir.arnauto...@sematext.com> wrote: > It is interesting that ES had similar feature to prefer primary/replica > but it deprecating that and will remove it - could not find explanation > why. > > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > >> On 5 Jan 2018, at 15:22, Ere Maijala wrote: >> >> Hi, >> >> It would be really nice to have a server-side option, though. Not > everyone uses Solrj, and a typical fairly dummy client just queries the > server without any understanding about shards etc. Solr could be clever > enough to not forward the query to NRT shards when configured to prefer > PULL shards and they're available. Maybe it could be something similar to > the preferLocalShards parameter, like "preferShardTypes=TLOG,PULL". >> >> --Ere >> >> Emir Arnautović kirjoitti 14.12.2017 klo 11.41: >>> Hi Stanislav, >>> I don’t think that there is a built in feature to do this, but that > sounds like nice feature of Solrj - maybe you should check if available. > You can implement it outside of Solrj - check cluster state to see which > shards are available and send queries only to pull replicas. >>> HTH, >>> Emir >>> -- >>> Monitoring - Log Management - Alerting - Anomaly Detection >>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ On 14 Dec 2017, at 09:58, Stanislav Sandalnikov < > s.sandalni...@gmail.com> wrote: Hi, We have a Solr 7.1 setup with SolrCloud where we have multiple shards > on one server (for indexing) each shard has a pull replica on other > servers. What are the possible ways to limit search request only to pull type > replicase? At the moment the only solution I found is to append shards parameter > to each query, but if new shards added later it requires to change >
Response time under 1 second?
Hello With a 3 nodes cluster each 12GB and a corpus of 5GB (CSV format). Is it better to disable completely Solr cache ? There is enough RAM for the entire index. Is there a way for reduce random queries under 1 second? Thanks!
Re: Solr Swap space
On 2/21/2018 7:58 PM, Susheel Kumar wrote: Below output for prod machine based on the steps you described. Please take a look. The solr searches are returning fine and no issue with performance but since last 4 months swap space started going up. After restart, it comes down to zero and then few weeks, it utilization reaches to 40-50% and thus requires restart of solr process. I bet that if you run this command, it will show you a value of 60: cat /proc/sys/vm/swappiness This makes the OS very aggressive about using swap, even when there is absolutely no need for it to do so. If you type the following series of commands, it should fix the problem and prevent it from happening again until you reboot the system: echo "0" > /proc/sys/vm/swappiness swapoff -a swapon -a Note that when the swapoff command runs, it will force the OS to read all the swapped data back into memory. It will take several minutes for this to occur, because it must read nearly a gigabyte of data and figure out how to put it back in memory. Both of the command outputs you included say that there is over 20GB of free memory. So I do not anticipate the system having problems from running these commands. It will slow the machine down temporarily, though -- so only do it during a quiet time for your Solr install. To make this setting survive a reboot, find the sysctl.conf file somewhere in your /etc directory and add this line to it: vm.swappiness = 0 This setting does not completely disable swap. If the system finds itself with real memory pressure and actually does NEED to use swap, it still will ... it just won't swap anything out before it's actually required. I do not think the behavior you are seeing is actually causing problems, based on your system load and CPU usage. But what I've shared should fix it for you. Thanks, Shawn
Re: Issue Using JSON Facet API Buckets in Solr 6.6
Thanks Antelmo, I'm trying to reproduce this now. -Yonik On Mon, Feb 19, 2018 at 10:13 AM, Antelmo Aguilar wrote: > Hi all, > > I was wondering if the information I sent is sufficient to look into the > issue. Let me know if you need anything else from me please. > > Thanks, > Antelmo > > On Thu, Feb 15, 2018 at 1:56 PM, Antelmo Aguilar wrote: > >> Hi, >> >> Here are two pastebins. The first is the full complete response with the >> search parameters used. The second is the stack trace from the logs: >> >> https://pastebin.com/rsHvKK63 >> >> https://pastebin.com/8amxacAj >> >> I am not using any custom code or plugins with the Solr instance. >> >> Please let me know if you need anything else and thanks for looking into >> this. >> >> -Antelmo >> >> On Wed, Feb 14, 2018 at 12:56 PM, Yonik Seeley wrote: >> >>> Could you provide the full stack trace containing "Invalid Date >>> String" and the full request that causes it? >>> Are you using any custom code/plugins in Solr? >>> -Yonik >>> >>> >>> On Mon, Feb 12, 2018 at 4:55 PM, Antelmo Aguilar wrote: >>> > Hi, >>> > >>> > I was using the following part of a query to get facet buckets so that I >>> > can use the information in the buckets for some post-processing: >>> > >>> > "json": >>> > "{\"filter\":[\"bundle:pop_sample\",\"has_abundance_data_b: >>> true\",\"has_geodata:true\",\"${project}\"],\"facet\":{\"ter >>> m\":{\"type\":\"terms\",\"limit\":-1,\"field\":\"${term:spec >>> ies_category}\",\"facet\":{\"collection_dates\":{\"type\":\ >>> "terms\",\"limit\":-1,\"field\":\"collection_date\",\"facet\ >>> ":{\"collection\": >>> > {\"type\":\"terms\",\"field\":\"collection_assay_id_s\",\"fa >>> cet\":{\"abnd\":\"sum(div(sample_size_i, >>> > collection_duration_days_i))\"" >>> > >>> > Sorry if it is hard to read. Basically what is was doing was getting >>> the >>> > following buckets: >>> > >>> > First bucket will be categorized by "Species category" by default >>> unless we >>> > pass in the request the "term" parameter which we will categories the >>> first >>> > bucket by whatever "term" is set to. Then inside this first bucket, we >>> > create another buckets of the "Collection date" category. Then inside >>> the >>> > "Collection date" category buckets, we would use some functions to do >>> some >>> > calculations and return those calculations inside the "Collection date" >>> > category buckets. >>> > >>> > This query is working fine in Solr 6.2, but I upgraded our instance of >>> Solr >>> > 6.2 to the latest 6.6 version. However it seems that upgrading to Solr >>> 6.6 >>> > broke the above query. Now it complains when trying to create the >>> buckets >>> > of the "Collection date" category. I get the following error: >>> > >>> > Invalid Date String:'Fri Aug 01 00:00:00 UTC 2014' >>> > >>> > It seems that when creating the buckets of a date field, it does some >>> > conversion of the way the date is stored and causes the error to appear. >>> > Does anyone have an idea as to why this error is happening? I would >>> really >>> > appreciate any help. Hopefully I was able to explain my issue well. >>> > >>> > Thanks, >>> > Antelmo >>> >> >>
Re: Response time under 1 second?
On 2/22/2018 8:53 AM, LOPEZ-CORTES Mariano-ext wrote: With a 3 nodes cluster each 12GB and a corpus of 5GB (CSV format). Is it better to disable completely Solr cache ? There is enough RAM for the entire index. The size of the input data will have an effect on how big the index is, but it is not a direct indication of the index size. The size of the index is more important than the size of the data that you send to Solr to create the index. You say 12GB ... but is this total system memory, or the max Java heap size for Solr? What are these two numbers for your servers? If you go to the admin UI for one of these servers and look at the Overview page for all of the index cores it contains, you will be able to see how many documents and what size each index is on disk. What are these numbers? If the numbers are similar for all the servers, then I will only need to see it for one of them. If the machine is running an OS like Linux that has the gnu top program, then I can see a lot of useful information from that program. Run "top" (not htop or other variants), press shift-M to sort the list by memory, and grab a screenshot. This will probably be an image file, so you'll need to find a file sharing site and give us a URL to access the file. Attachments rarely make it to the mailing list. Thanks, Shawn
SOLR Score Range Changed
I am migrating from SOLR 4.10.2 to SOLR 7.1. All seems to be going well, except for one thing: the score that is coming back for the resulting documents is giving different scores. The core uses a schema. Here's the schema info for the field that i am searching on: When searching maxrows=750, fields: *,score IDX_Company:(cat and scratch) SOLR 7.1: max score 6.95 and a min of 6.28 SOLR 4.10.2: max score 8.63 and a min of 0.91 IDX_InsuredName:(cat and scratch and fever) SOLR 7.1 max score of 12.99 and a min of 11.25 SOLR 4.10.2 max 3.97 and min of 0.77 See how the range of values is different (ranges in 7.1 dont go down to 0.x) Also notice that the max score doubles when I add one word to the search terms in 7.1. Most important, the ranges in 4.10.2 overlap - but the 7.1 dont. A little more information to show you how I use this information, and why this is causing a problem. I get a company name like "bobs cabinetry" and another "all american tech enterprise" I run two SOLR queries per company name, I'll call them 1-AND, 1-OR, 2-AND, 2-OR. IDX_Company:(bobs AND cabinetry) &f=*,score,requestid:"1-AND" IDX_Company:(bobs OR cabinetry) &f=*,score,requestid:"1-OR" IDX_Company:(all AND american AND tech AND enterprise) &f=*,score,requestid:"2-AND" IDX_Company:(all OR american OR tech OR enterprise) &f=*,score,requestid:"2-OR" I combine the results together sort by descending score, and then take the top 750 rows.(The requestid lets me know which query the results came from) Because of the changes in the range of scores, the sort pushes all of the all american tech enterprise rows to the top of the results (because of no overlap), and when the top 750 are taken everything for bobs carpentry is removed from the results. Is there some config setting I can change to make score calculation act like it did in 4.10.2? Or something else?
Re: Solr Swap space
Cool, Thanks, Shawn. I was also looking the swapiness and it is set to 60. Will try this out and let you know. Thanks, again. On Thu, Feb 22, 2018 at 10:55 AM, Shawn Heisey wrote: > On 2/21/2018 7:58 PM, Susheel Kumar wrote: > >> Below output for prod machine based on the steps you described. Please >> take a look. The solr searches are returning fine and no issue with >> performance but since last 4 months swap space started going up. After >> restart, it comes down to zero and then few weeks, it utilization reaches >> to 40-50% and thus requires restart of solr process. >> > > I bet that if you run this command, it will show you a value of 60: > > cat /proc/sys/vm/swappiness > > This makes the OS very aggressive about using swap, even when there is > absolutely no need for it to do so. > > If you type the following series of commands, it should fix the problem > and prevent it from happening again until you reboot the system: > > echo "0" > /proc/sys/vm/swappiness > swapoff -a > swapon -a > > Note that when the swapoff command runs, it will force the OS to read all > the swapped data back into memory. It will take several minutes for this > to occur, because it must read nearly a gigabyte of data and figure out how > to put it back in memory. Both of the command outputs you included say that > there is over 20GB of free memory. So I do not anticipate the system > having problems from running these commands. It will slow the machine down > temporarily, though -- so only do it during a quiet time for your Solr > install. > > To make this setting survive a reboot, find the sysctl.conf file somewhere > in your /etc directory and add this line to it: > > vm.swappiness = 0 > > This setting does not completely disable swap. If the system finds itself > with real memory pressure and actually does NEED to use swap, it still will > ... it just won't swap anything out before it's actually required. > > I do not think the behavior you are seeing is actually causing problems, > based on your system load and CPU usage. But what I've shared should fix > it for you. > > Thanks, > Shawn > >
Re: SOLR Score Range Changed
On 2/22/2018 9:50 AM, Hodder, Rick wrote: I am migrating from SOLR 4.10.2 to SOLR 7.1. All seems to be going well, except for one thing: the score that is coming back for the resulting documents is giving different scores. The absolute score has no meaning when you change something -- the index, the query, the software version, etc. You can't compare absolute scores. What matters is the relative score of one document to another *in the same query*. The amount of difference is almost irrelevant -- the goal of Lucene's score calculation gymnastics is to have one document score higher than another, so the *order* is reasonably correct. Assuming you're using the default relevancy sort, does the order of your search results change dramatically from one version to the other? If it does, is the order generally better from a relevance standpoint, or generally worse? If you are specifying an explicit sort, then the scores will likely be ignored. What I am describing is also why it's strongly recommended that you never try to convert scores to percentages: https://wiki.apache.org/lucene-java/ScoresAsPercentages Thanks, Shawn
RE: Response time under 1 second?
For the moment, I have the following information: 12GB is max java heap. Total memory i don't know. No direct access to host. 2 replicas = Size 1 = 11.51 GB Size 2 = 11.82 GB (Sizes showed in the Core-Overview admin gui) Thanks very much! -Message d'origine- De : Shawn Heisey [mailto:elyog...@elyograg.org] Envoyé : jeudi 22 février 2018 17:06 À : solr-user@lucene.apache.org Objet : Re: Response time under 1 second? On 2/22/2018 8:53 AM, LOPEZ-CORTES Mariano-ext wrote: > With a 3 nodes cluster each 12GB and a corpus of 5GB (CSV format). > > Is it better to disable completely Solr cache ? There is enough RAM for the > entire index. The size of the input data will have an effect on how big the index is, but it is not a direct indication of the index size. The size of the index is more important than the size of the data that you send to Solr to create the index. You say 12GB ... but is this total system memory, or the max Java heap size for Solr? What are these two numbers for your servers? If you go to the admin UI for one of these servers and look at the Overview page for all of the index cores it contains, you will be able to see how many documents and what size each index is on disk. What are these numbers? If the numbers are similar for all the servers, then I will only need to see it for one of them. If the machine is running an OS like Linux that has the gnu top program, then I can see a lot of useful information from that program. Run "top" (not htop or other variants), press shift-M to sort the list by memory, and grab a screenshot. This will probably be an image file, so you'll need to find a file sharing site and give us a URL to access the file. Attachments rarely make it to the mailing list. Thanks, Shawn
Re: Response time under 1 second?
On 2/22/2018 10:45 AM, LOPEZ-CORTES Mariano-ext wrote: For the moment, I have the following information: 12GB is max java heap. Total memory i don't know. No direct access to host. 2 replicas = Size 1 = 11.51 GB Size 2 = 11.82 GB (Sizes showed in the Core-Overview admin gui) OK, so you have about 23GB of total index data on the machine. With a 12GB heap, and assuming there's no other software running on the machine, then for good performance I would want to have at least 32GB total memory, which leaves around 20GB for the OS to cache the 23GB index. More memory would be better, but probably isn't a requirement. If there is other software running on the machine, then that will increase the total memory requirement. It is always possible that your Solr install is in a situation where 12GB of heap is actually not quite big enough. If that happens, performance will usually be a lot worse than in situations where the left-over memory is not enough for the OS to cache the index properly. You might be able to get decent performance if the total memory is about 24GB, but that much might NOT be enough. There are a lot of factors affecting actual memory requirements. The Solr admin UI will tell you what the total physical memory in the system is, on the dashboard. It will be the upper right graph. Note that this graph is likely to show 100% or nearly 100% full. Don't let this alarm you -- it's normal. How did you arrive at the 12GB size for your heap? Have you tried reducing this number so that there is more memory left for the OS to handle disk caching? I have no idea whether your Solr install will still work properly with a smaller heap, so be aware that reducing the heap might cause more problems. https://wiki.apache.org/solr/SolrPerformanceProblems#RAM Thanks, Shawn
Turn on/off query based on a url parameter
Hi, I want to enable or disable a SolrFeature in LTR based on efi parameter. In simple the query should be executed only if a parameter is true. Any examples or suggestion on how to accomplish this? Functions queries examples are are using fields to give a value to. In my case I want to execute the query only if a url parameter is true Thanks, Roopa
RE: Turn on/off query based on a url parameter
I always filter solr request via a proxy (so solr itself is not exposed directly to the web). In that proxy, the query parameters can be broken down and filtered as desired (I examine authorities granted to a session to control even which indexes are being searched) before passing the modified url to solr. The coding of the proxy obviously depends on your application environment. We use java and Spring. -Original Message- From: Roopa Rao [mailto:roop...@gmail.com] Sent: Friday, 23 February 2018 8:04 a.m. To: solr-user@lucene.apache.org Subject: Turn on/off query based on a url parameter Hi, I want to enable or disable a SolrFeature in LTR based on efi parameter. In simple the query should be executed only if a parameter is true. Any examples or suggestion on how to accomplish this? Functions queries examples are are using fields to give a value to. In my case I want to execute the query only if a url parameter is true Thanks, Roopa Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.
Re: Solr Autoscaling multi-AZ rules
I managed to miss this reply earlier, but: Shard: A logical segment of a collection Replica: A physical core, representing a particular Shard Replication Factor (RF): A set of Replicas, such that a single Replica exists for each Shard in a Collection. Availability Zone (AZ): A partitioned set of nodes such that a physical or hardware failure in one AZ should not affect another AZ. AZ could mean distinct racks in a data center, or distinct data centers, but I happen to specifically mean the AWS definition here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-regions-availability-zones So an RF2 collection with 2 shards means I have four Replicas in my collection, two shard1 and two shard2. If it's RF3, then I have six: three shard1 and three shard2. I'm using "Distinct RF" as a shorthand for "a single replica for every shard in the collection". In the RF2 example above, if I have two Availability Zones, I would want a Distinct RF in each AZ. So, a replica for shard1 and shard2 in AZ1, and a replica for shard1 and shard2 in AZ2. I would *not* want, say, both shard1 replicas in AZ1 because then a failure of AZ1 could leave me with no replicas for shard1 and an incomplete collection. If I had RF6 and two AZs, I would want three Distinct RFs in each AZ. (three replicas for each shard, per AZ) I understand that {"replica": "<7", "node":"#ANY"} may result in two replicas of the same shard ending up on the same node. However, the other rule should prevent this: {"replica": "<2", "shard": "#EACH", "node": "#ANY"} So by using both rules, that should mean "no more than six replicas on a node, where all the replicas on that node represent distinct shards". Right? On 2/12/18, 12:18 PM, "Noble Paul" wrote: >>Goal: No node should have more than 6 shards This is not possible today {"replica": "<7", "node":"#ANY"} , means don't put more than 7 replicas of the collection (irrespective of the shards) in a given node what do you mean by distinct 'RF' ? I think we are screwing up the terminologies a bit here On Wed, Feb 7, 2018 at 1:38 PM, Jeff Wartes wrote: > I’ve been messing around with the Solr 7.2 autoscaling framework this week. Some things seem trivial, but I’m also running into questions and issues. If anyone else has experience with this stuff, I’d be glad to hear it. Specifically: > > > Context: > -One collection, consisting of 42 shards, where up to 6 shards can fit on a single node. (which means 7 nodes per Replication Factor) > -Three AZs, each with its own ip_2 value. > > Goals: > > Goal: Fully utilize available nodes. > Cluster Preference: {“maximize”: "cores”} > > Goal: No node should have more than one replica of a given shard > Rule: {"replica": "<2", "shard": "#EACH", "node": "#ANY"} > > Goal: No node should have more than 6 shards > Rule: {"replica": "<7", "node":"#ANY"} > > Goal: Where possible, distinct RFs should each exist in an AZ. > (Example1: I’d like 7 nodes with a complete RF in AZ 1 and 7 nodes with a complete RF in AZ 2, and not end up with, say, both shard2 replicas in AZ 1) > (Example2: If I have 14 nodes in AZ 1 and 7 in AZ 2, I should have two full RFs in AZ 1 and one in AZ 2) > Rule: ??? > > I could have multiple non-strict rules perhaps? Like: > {"replica": "<2", "shard": "#EACH", "ip_2": "1", "strict":false} > {"replica": "<3", "shard": "#EACH", "ip_2": "1", "strict":false} > {"replica": "<4", "shard": "#EACH", "ip_2": "1", "strict":false} > {"replica": "<2", "shard": "#EACH", "ip_2": "2", "strict":false} > {"replica": "<3", "shard": "#EACH", "ip_2": "2", "strict":false} > {"replica": "<4", "shard": "#EACH", "ip_2": "2", "strict":false} > etc > So having more than one RF in an AZ is a technical “violation”, but if placement minimizes non-strict violations, replicas would tend to get placed correctly. > > > Given a working set of rules, I’m still having trouble with two things: > > 1. I’ve manually created the “.system” collection, as it didn’t seem to get created automatically. However, autoscaling activity is not getting logged to it. > 2. I can’t seem to figure out how to scale up. > * I’d presumed editing the collection’s “replicationFactor” would do the trick, but it does not. > * The “node-up” trigger will serve to replace lost replicas, but won’t otherwise take advantage of additional capacity. > >i. There’s a UTILIZENODE command in 7.2, but it appears that’s still something you need to trigger manually. > > Anyone played with this stuff? -- - Noble Paul
Re: Deploying solr to tomcat 7
Dear Shawn, Thanks a lot quick response. I will check with the same. Thanks & Regards Fazulur Rehaman On Wed, Feb 21, 2018 at 4:55 PM, Shawn Heisey wrote: > On 2/21/2018 3:00 AM, Rehaman wrote: > >> We installed Ensembl server in our environment and not able to query >> databases with large number of entries. And for that purpose we need to use >> indexed databases through Solr search engine. >> >> We have installed Solr search engine (Solr Specification Version: 3.6.1) >> on Tomcat 7. Able to see Solr main page "Welcome to Solr" with ensembl >> shards. >> > > I don't know anything about Ensembl. But I can comment about Solr. > Version 3.6.1 is nearly six years old, and is four major versions out of > date, as version 7.2.1 is the current release. I can attempt to help, but > this version is so old that it's effectively end of life. > > When I try to query each shard I am getting error "HTTP status 500" error. >> I have searched in forum for this and not able to resolve. Please find >> attached error log. >> > > This is the relevant line from the log that indicates the problem: > > Caused by: java.net.ConnectException: Connection refused (Connection > refused) > > The Solr server is trying to access one of the URL endpoints mentioned in > the "shards" parameter. That connection is being refused. Which means > that either the traffic is being blocked, possibly by a firewall, or the > URL endpoint in the shards parameter is not correct. > > Thanks, > Shawn > >
Re: Issue Using JSON Facet API Buckets in Solr 6.6
I've reproduced the issue and opened https://issues.apache.org/jira/browse/SOLR-12020 -Yonik On Thu, Feb 22, 2018 at 11:03 AM, Yonik Seeley wrote: > Thanks Antelmo, I'm trying to reproduce this now. > -Yonik > > > On Mon, Feb 19, 2018 at 10:13 AM, Antelmo Aguilar wrote: >> Hi all, >> >> I was wondering if the information I sent is sufficient to look into the >> issue. Let me know if you need anything else from me please. >> >> Thanks, >> Antelmo >> >> On Thu, Feb 15, 2018 at 1:56 PM, Antelmo Aguilar wrote: >> >>> Hi, >>> >>> Here are two pastebins. The first is the full complete response with the >>> search parameters used. The second is the stack trace from the logs: >>> >>> https://pastebin.com/rsHvKK63 >>> >>> https://pastebin.com/8amxacAj >>> >>> I am not using any custom code or plugins with the Solr instance. >>> >>> Please let me know if you need anything else and thanks for looking into >>> this. >>> >>> -Antelmo >>> >>> On Wed, Feb 14, 2018 at 12:56 PM, Yonik Seeley wrote: >>> Could you provide the full stack trace containing "Invalid Date String" and the full request that causes it? Are you using any custom code/plugins in Solr? -Yonik On Mon, Feb 12, 2018 at 4:55 PM, Antelmo Aguilar wrote: > Hi, > > I was using the following part of a query to get facet buckets so that I > can use the information in the buckets for some post-processing: > > "json": > "{\"filter\":[\"bundle:pop_sample\",\"has_abundance_data_b: true\",\"has_geodata:true\",\"${project}\"],\"facet\":{\"ter m\":{\"type\":\"terms\",\"limit\":-1,\"field\":\"${term:spec ies_category}\",\"facet\":{\"collection_dates\":{\"type\":\ "terms\",\"limit\":-1,\"field\":\"collection_date\",\"facet\ ":{\"collection\": > {\"type\":\"terms\",\"field\":\"collection_assay_id_s\",\"fa cet\":{\"abnd\":\"sum(div(sample_size_i, > collection_duration_days_i))\"" > > Sorry if it is hard to read. Basically what is was doing was getting the > following buckets: > > First bucket will be categorized by "Species category" by default unless we > pass in the request the "term" parameter which we will categories the first > bucket by whatever "term" is set to. Then inside this first bucket, we > create another buckets of the "Collection date" category. Then inside the > "Collection date" category buckets, we would use some functions to do some > calculations and return those calculations inside the "Collection date" > category buckets. > > This query is working fine in Solr 6.2, but I upgraded our instance of Solr > 6.2 to the latest 6.6 version. However it seems that upgrading to Solr 6.6 > broke the above query. Now it complains when trying to create the buckets > of the "Collection date" category. I get the following error: > > Invalid Date String:'Fri Aug 01 00:00:00 UTC 2014' > > It seems that when creating the buckets of a date field, it does some > conversion of the way the date is stored and causes the error to appear. > Does anyone have an idea as to why this error is happening? I would really > appreciate any help. Hopefully I was able to explain my issue well. > > Thanks, > Antelmo >>> >>>
Indexing timeout issues with SolrCloud 7.1
I'm trying to debug why indexing in SolrCloud 7.1 is having so many issues. It will hang most of the time, and timeout the rest. Here's an example: time curl -s 'myhost:8080/solr/mycollection/update/json/docs' -d '{"solr_id":"test_001", "data_type":"test"}'|jq . { "responseHeader": { "status": 0, "QTime": 5004 } } curl -s 'myhost:8080/solr/mycollection/update/json/docs' -d 0.00s user 0.00s system 0% cpu 5.025 total jq . 0.01s user 0.00s system 0% cpu 5.025 total Here's some of the timeout errors I'm seeing: 2018-02-23 03:55:02.903 ERROR (qtp1595212853-3607) [c:mycollection s:shard1 r:core_node12 x:mycollection_shard1_replica_n11] o.a.s.h.RequestHandlerBase java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 12/12 ms 2018-02-23 03:55:02.903 ERROR (qtp1595212853-3607) [c:mycollection s:shard1 r:core_node12 x:mycollection_shard1_replica_n11] o.a.s.s.HttpSolrCall null:java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 12/12 ms 2018-02-23 03:55:36.517 ERROR (recoveryExecutor-3-thread-4-processing-n:solr2-d.myhost:8080_solr x:mycollection_shard1_replica_n11 s:shard1 c:mycollection r:core_node12) [c:mycollection s:shard1 r:core_node12 x:mycollection_shard1_replica_n11] o.a.s.h.ReplicationHandler Index fetch failed :org.apache.solr.common.SolrException: Index fetch failed : 2018-02-23 03:55:36.517 ERROR (recoveryExecutor-3-thread-4-processing-n:solr2-d.myhost:8080_solr x:mycollection_shard1_replica_n11 s:shard1 c:mycollection r:core_node12) [c:mycollection s:shard1 r:core_node12 x:mycollection_shard1_replica_n11] o.a.s.c.RecoveryStrategy Error while trying to recover:org.apache.solr.common.SolrException: Replication for recovery failed. We currently have two separate Solr clusters. Our current in-production cluster which runs on Solr 3.4 and a new ring that I'm trying to bring up which runs on SolrCloud 7.1. I have the exact same code that is indexing to both clusters. The Solr 3.4 indexes fine, but I'm running into lots of issues with SolrCloud 7.1. Some additional details about the setup: * 5 nodes solr2-a through solr2-e. * 5 replicas * 1 shard * The servers have 48G of RAM with -Xmx and -Xms set to 16G * I currently have soft commits at 10m intervals and hard commits (with openSearcher=false) at 1m intervals. I also tried 5m (soft) and 15s (hard) as well. Any help or pointers would be greatly appreciated. Thanks! This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
Solrj SolrServer not converting the Collection of Pojo Objects inside Parent Pojo
We are using Solrj version 4.10.4 as the java client to add documents into Solr version 1.4.1 Sample Pojo Object: @SolrDocument(solrCoreName="customer") public class Customer { private String customerId; private String customerName; private int age; private List addresses; //getters and setters } public class Address { private String street; private String city; private String state; private String country; private Long zip; //getters and setters } When indexing the customer Document with the below schema Customer document that gets indexed in Solr is having the Address Object Memory Address@spjdspf13Address@sdf535 reference as arr of elements instead of individual fields of Address. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Solr not accessible - javax.net.ssl.SSLException
Greetings, Apache Solr community. I'm here to ask for your help and advice about a Solr-related problem I'm having. My company is an e-commerce website and uses Solr in production for the querying of items in our inventory. The Solr installation was done by an engineer who has left the company. About 2 weeks ago, Solr stopped working completely (our website wasn't rendering completely and we lost the search functionality). We also couldn't access the Solr dashboard, located in our server at [https://api.ishippo.com:8282/solr/#](https://api.mydomain.com:8282/solr/#) (NB - Solr runs on ports 8282 in our server.) I logged onto the remote server where Solr was installed and ran > bin/solr status I got this message - Found 1 Solr nodes: Solr process 4365 running on port 8282 ERROR: Failed to get system information from https://localhost:8282/solr due to: javax.net.ssl.SSLException: Certificate for doesn't match any of the subject alternative names: [*.ishippo.com, ishippo.com] We figured that it could be an SSL issue and tried accessing the Solr dashboard through plain HTTP by plugging in our server's IP address. This time, we could access the Solr dashboard. But our website works solely by https, so the Solr query gets blocked every time. It seems that only https connections are being blocked by Solr and its port (8282). Everything works fine on the other ports, and on http. We contacted our SSL certificate authority, and they said everything was fine from their end. They even made us perform openssl tests and send them the output. but they couldn't find anything cause from their end. (I have the openssl messages returned from the tests, which are long. I can share them if someone needs it) What could be the issue here? I have tried so many things in order to fix this to no avail. Does anybody know what's going on and help a user out? Thank you for your patience, iShippo Here is a summary of the problem - - Solr dashboard (located on https://api.ishippo.com:8282/solr/#) is not accessible. - Only port 8282 (which Solr runs on is affected). Services also running on api.ishippo.com on other ports are running fine. - Solr throws a javax.net.ssl.SSLException error. - We discover we are able to access the Solr dashboard by looking up the IP address of our server (and not the URL) on http (http://52.66.65.108:8282/solr/#) - Our platform runs soleley on HTTPS, so we're not able to go around it by using http. - Our SSL certificate authority couldn't find a cause on their end.