Re: Searching for a term which isn't a part of an expression
Hi, The list of phrases wil be relatively dynamic, so changing the indexing process isn't a very good solution for us. We also considered using a PostFilter or adding a SearchComponent to filter out the "bad" results, but obviously a true query-time support would be a lot better. On Wed, Dec 14, 2016 at 10:52 PM, Ahmet Arslan wrote: > Hi, > > Do you have a common list of phrases that you want to prohibit partial > match? > You can index those phrases in a special way, for example, > > This is a new world hello_world hot_dog tap_water etc. > > ahmet > > > On Wednesday, December 14, 2016 9:20 PM, deansg wrote: > We would like to enable queries for a specific term that doesn't appear as > a > part of a given expression. Negating the expression will not help, as we > still want to return items that contain the term independently, even if > they > contain full expression as well. > For example, we would like to search for items that have the term "world" > but not as a part of "hello world". If the text is: "This is a new world. > Hello world", we would still want to return the item, as "world" appears > independently as well as a part of "Hello world". However, we will not want > to return items that only have the expression "hello world" in them. > Does Solr support these types of queries? We thought about using regex, but > since the text is tokenized I don't think that will be possible. > > > > -- > View this message in context: http://lucene.472066.n3. > nabble.com/Searching-for-a-term-which-isn-t-a-part-of-an- > expression-tp4309746.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Solr Cloud Replica Cores Give different Results for the Same query
Let's back up a bit. You say "This seems to cause two replicas to return different hits depending upon which one is queried." OK, _how_ are they different? I've been assuming different numbers of hits. If you're getting the same number of hits but different document ordering, that's a completely different issue and may be easily explainable. If this is true, skip the rest of this message. I only realized we may be using a different definition of "different hits" part way through writing this reply. Having the timestamp as a string isn't a problem, you can do something very similar with wildcards and the like if it's a string that sorts the same way the timestamp would. And it's best if it's created upstream anyway that way it's guaranteed to be the same for the doc on all replicas. If the date is in canonical form (-MM-DDTHH:MM:SSZ) then a simple copyfield to a date field would do the trick. But there's no real reason to do any of that. Given that you see this when there's no indexing going on then there's no point to those tests, those were just for a way to examine your nodes while there was active indexing. How do you fix this problem when you see it? If it goes away by itself that would gives at least a start on where to look. If you have to manually intervene it would be good to know what you do. The CDCR pattern is docs to from the leader on the source cluster to the leader on the target cluster. Once the target leader gets the docs, it's supposed to send the doc to all the replicas. To try to narrow down the issue, next time it occurs can you look at _both_ the source and target clusters and see if they _both_ show the same discrepancy? What I'm looking for is whether both are self-consistent. That is, all the replicas for shardN on the source cluster show the same documents (M). All the replicas for shardN on the target cluster show the same number of docs (N). I'm not as concerned if M != N at this point. Note I'm looking at the number of hits here, not say the document ordering. To do this you'll have to do the trick I mentioned where you query each replica separately. And are you absolutely sure that your different results are coming from the _same_ cluster? If you're comparing a query from the source cluster with a query from the target cluster, that's different than if the queries come from the same cluster. Best, Erick On Wed, Dec 14, 2016 at 2:48 PM, Webster Homer wrote: > Thanks for the quick feedback. > > We are not doing continuous indexing, we do a complete load once a week and > then have a daily partial load for any documents that have changed since > the load. These partial loads take only a few minutes every morning. > > The problem is we see this discrepancy long after the data load completes. > > We have a source collection that uses cdcr to replicate to the target. I > see the current=false setting in both the source and target collections. > Only the target collection is being heavily searched so that is where my > concern is. So what could cause this kind of issue? > Do we have a configuration problem? > > It doesn't happen all the time, so I don't currently have a reproducible > test case, yet. > > I will see about adding the timestamp, we have one, but it was created as a > string, and was generated by our ETL job > > On Wed, Dec 14, 2016 at 3:42 PM, Erick Erickson > wrote: > >> The commit points on different replicas will trip at different wall >> clock times so the leader and replica may return slightly different >> results depending on whether doc X was included in the commit on one >> replica but not on the second. After the _next_ commit interval (2 >> seconds in your case), doc X will be committed on the second replica: >> that is it's not lost. >> >> Here's a couple of ways to verify: >> >> 1> turn off indexing and wait a few seconds. The replicas should have >> the exact same documents. "A few seconds" is your autocommit (soft in >> your case) interval + autowarm time. This last is unknown, but you can >> check your admin/plugins-stats search handler times, it's reported >> there. Now issue your queries. If the replicas don't report the same >> docs A Bad Thing that should be worrying. BTW, with a 2 second soft >> commit interval, which is really aggressive, you _better not_ have >> very large autowarm intervals! >> >> 2> Include a timestamp in your docs when they are indexed. There's an >> automatic way to do that BTW now do your queries and append an FQ >> clause like &fq=timestamp:[* TO some_point_in_the_past]. The replicas >> should have the same counts unless you are deleting documents. I >> mention deletes on the off chance that you're deleting documents that >> fall in the interval and then the same as above could theoretically >> occur. Updates should be fine. >> >> BTW, I've seen continuous monitoring of this done by automated >> scripts. The key is to get the shard URL and ping that with >> &distrib=false. It'll
Re: Nested JSON Facets (Subfacets)
That should work... what version of Solr are you using? Did you change the type of the popularity field w/o completely reindexing? You can try to verify the number of documents in each bucket that have the popularity field by adding another sub-facet next to cat_pop: num_pop:{query:"popularity:[* TO *]"} > A quick check with this json.facet parameter: > > json.facet: {cat_pop:"sum(popularity)“} > > returns: > > "facets“: { > "count":2508, > "cat_pop":21.0}, That looks like a pretty low sum for all those documents perhaps most of them are missing "popularity" (or have a 0 popularity). To test one of the buckets at the top-level this way, you could add fq=shop_cat:"Men > Clothing > Jumpers & Cardigans" and see if you get anything. -Yonik On Wed, Dec 14, 2016 at 12:46 PM, CA wrote: > Hi all, > > this is about using a function in nested facets, specifically the „sum()“ > function inside a „terms“ facet using the json.facet api. > > My json.facet parameter looks like this: > > json.facet={shop_cat: {type:terms, field:shop_cat, facet: > {cat_pop:"sum(popularity)"}}} > > A snippet of the result: > > "facets“: { > "count":2508, > "shop_cat“: { > "buckets“: [{ > "val“: "Men > Clothing > Jumpers & Cardigans", > "count":252, > "cat_pop“:0.0 > }, { >"val":"Men > Clothing > Jackets & Coats", >"count":157, >"cat_pop“:0.0 > }, // and more > > This looks fine all over but it turns out that „cat_pop“, the result of > „sum(popularity)“ is always 0.0 even if the documents for this facet value > have popularities > 0. > > A quick check with this json.facet parameter: > > json.facet: {cat_pop:"sum(popularity)“} > > returns: > > "facets“: { > "count":2508, > "cat_pop":21.0}, > > To me, it seems it works fine on the base level but not when nested. Still, > Yonik’s documentation and the Jira issues indicate that it is possible to use > functions in nested facets so I might just be using the wrong structure? I > have a hard time finding any other examples on the i-net and I had no luck > changing the structure around. > Could someone shed some light on this for me? It would also help to know if > it is not possible to sum the values up this way. > > Thanks a lot! > Chantal > >
Re: Solr Cloud Replica Cores Give different Results for the Same query
Thanks for the quick feedback. We are not doing continuous indexing, we do a complete load once a week and then have a daily partial load for any documents that have changed since the load. These partial loads take only a few minutes every morning. The problem is we see this discrepancy long after the data load completes. We have a source collection that uses cdcr to replicate to the target. I see the current=false setting in both the source and target collections. Only the target collection is being heavily searched so that is where my concern is. So what could cause this kind of issue? Do we have a configuration problem? It doesn't happen all the time, so I don't currently have a reproducible test case, yet. I will see about adding the timestamp, we have one, but it was created as a string, and was generated by our ETL job On Wed, Dec 14, 2016 at 3:42 PM, Erick Erickson wrote: > The commit points on different replicas will trip at different wall > clock times so the leader and replica may return slightly different > results depending on whether doc X was included in the commit on one > replica but not on the second. After the _next_ commit interval (2 > seconds in your case), doc X will be committed on the second replica: > that is it's not lost. > > Here's a couple of ways to verify: > > 1> turn off indexing and wait a few seconds. The replicas should have > the exact same documents. "A few seconds" is your autocommit (soft in > your case) interval + autowarm time. This last is unknown, but you can > check your admin/plugins-stats search handler times, it's reported > there. Now issue your queries. If the replicas don't report the same > docs A Bad Thing that should be worrying. BTW, with a 2 second soft > commit interval, which is really aggressive, you _better not_ have > very large autowarm intervals! > > 2> Include a timestamp in your docs when they are indexed. There's an > automatic way to do that BTW now do your queries and append an FQ > clause like &fq=timestamp:[* TO some_point_in_the_past]. The replicas > should have the same counts unless you are deleting documents. I > mention deletes on the off chance that you're deleting documents that > fall in the interval and then the same as above could theoretically > occur. Updates should be fine. > > BTW, I've seen continuous monitoring of this done by automated > scripts. The key is to get the shard URL and ping that with > &distrib=false. It'll look something like > http://host:port/solr/collection_shard1_replica1 People usually > just use *:* and compare numFound. > > Best, > Erick > > > > On Wed, Dec 14, 2016 at 1:10 PM, Webster Homer > wrote: > > We are using Solr Cloud 6.2 > > > > We have been noticing an issue where the index in a core shows as > current = > > false > > > > We have autocommit set for 15 seconds, and soft commit at 2 seconds > > > > This seems to cause two replicas to return different hits depending upon > > which one is queried. > > > > What would lead to the indexes not being "current"? The documentation on > > the meaning of current is vague. > > > > The collections in our cloud have two shards each with two replicas. I > see > > this with several of the collections. > > > > We don't know how they get like this but it's troubling > > > > -- > > > > > > This message and any attachment are confidential and may be privileged or > > otherwise protected from disclosure. If you are not the intended > recipient, > > you must not copy this message or attachment or disclose the contents to > > any other person. If you have received this transmission in error, please > > notify the sender immediately and delete the message and any attachment > > from your system. Merck KGaA, Darmstadt, Germany and any of its > > subsidiaries do not accept liability for any omissions or errors in this > > message which may arise as a result of E-Mail-transmission or for damages > > resulting from any unauthorized changes of the content of this message > and > > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its > > subsidiaries do not guarantee that this message is free of viruses and > does > > not accept liability for any damages caused by any virus transmitted > > therewith. > > > > Click http://www.merckgroup.com/disclaimer to access the German, French, > > Spanish and Portuguese versions of this disclaimer. > -- This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you must not copy this message or attachment or disclose the contents to any other person. If you have received this transmission in error, please notify the sender immediately and delete the message and any attachment from your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability for any omissions or errors in this message which may arise as a result of E-Mail-transmission or for damages resulting from an
RE: DocTransformer not always working
Hello - i just looked up the DocTransformer Javadoc and spotted the getExtraRequestFields method. What you mention makes sense, so i immediately tried: solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id asc&q=*:*&fl=minhash,minhash:[binstr] { "response":{"numFound":97895,"start":0,"docs":[ { "minhash":"11101101001010001101001010111101100100110010"}] }} So as i get it, instead of using getRequestedFields, just now i just did an explicit get for that fields. Don't mind the changed numFound, it's a live index. Well, i can work with this really fine knowing this, but does it make sense? I did assume (or be wrong in doing so) that fl=minhash:[binstr] should mean get that field and pass it through the transformer. At least i just now fell for it, maybe other shouldn't :) Anyway, thanks again today, Markus -Original message- > From:Chris Hostetter > Sent: Wednesday 14th December 2016 23:14 > To: solr-user > Subject: Re: DocTransformer not always working > > > Fairly certain you aren't overridding getExtraRequestFields, so when your > DocTransformer is evaluated it can'd find the field you want it to > transform. > > By default, the ResponseWriters don't provide any fields that aren't > explicitly requested by the user, or specified as "extra" by the > DocTransformer. > > IIUC you want the stored value of the "minhash" field to be available to > you, but the response writer code doesn't know that -- it just knows you > want "minhash" to be the output respons key for the "[binstr]" > transformer. > > > Take a look at RawValueTransformerFactory as an example to borrow from. > > > > > : Date: Wed, 14 Dec 2016 21:55:26 + > : From: Markus Jelsma > : Reply-To: solr-user@lucene.apache.org > : To: solr-user > : Subject: DocTransformer not always working > : > : Hello - I just spotted an oddity with all two custom DocTransformers we > sometimes use on Solr 6.3.0. This particular transformer in the example just > transforms a long (or int) into a sequence of bits. I just use it as an > convenience to compare minhashes with my eyeballs. First example is very > straightforward, fl=minhash:[binstr], show only the minhash field, but as a > bit sequence. > : > : > solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id%20asc&q=*:*&fl=minhash:[binstr] > : { > : "response":{"numFound":96933,"start":0,"docs":[ > : {}] > : }} > : > : The document is empty! This also happens with another transformer. The next > example i also request the lang field: > : > : solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id > asc&q=*:*&fl=lang,minhash:[binstr] > : { > : "response":{"numFound":96933,"start":0,"docs":[ > : { > : "lang":"nl"}] > : }} > : > : Ok, at least i now get the lang field, but the transformed minhash is > nowhere to be seen. In the next example i request all fields and the > transformed minhash: > : > : > /solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id%20asc&q=*:*&fl=*,minhash:[binstr] > : { > : "response":{"numFound":96933,"start":0,"docs":[ > : { > : > "minhash":"11101101001010001101001010111101100100110010", > : ...other fields here > : "_version_":1553728923368423424}] > : }} > : > : So it seems that right now, i can only use a transformer properly if i > request all fields. I believe it used to work with all three examples just as > you would expect. But since i haven't used transformers for a while, i don't > know at which version it stopped working like that (if it ever did of course > :) > : > : Did i mess something up or did a bug creep on me? > : > : Thanks, > : Markus > : > > -Hoss > http://www.lucidworks.com/ >
Re: DocTransformer not always working
Fairly certain you aren't overridding getExtraRequestFields, so when your DocTransformer is evaluated it can'd find the field you want it to transform. By default, the ResponseWriters don't provide any fields that aren't explicitly requested by the user, or specified as "extra" by the DocTransformer. IIUC you want the stored value of the "minhash" field to be available to you, but the response writer code doesn't know that -- it just knows you want "minhash" to be the output respons key for the "[binstr]" transformer. Take a look at RawValueTransformerFactory as an example to borrow from. : Date: Wed, 14 Dec 2016 21:55:26 + : From: Markus Jelsma : Reply-To: solr-user@lucene.apache.org : To: solr-user : Subject: DocTransformer not always working : : Hello - I just spotted an oddity with all two custom DocTransformers we sometimes use on Solr 6.3.0. This particular transformer in the example just transforms a long (or int) into a sequence of bits. I just use it as an convenience to compare minhashes with my eyeballs. First example is very straightforward, fl=minhash:[binstr], show only the minhash field, but as a bit sequence. : : solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id%20asc&q=*:*&fl=minhash:[binstr] : { : "response":{"numFound":96933,"start":0,"docs":[ : {}] : }} : : The document is empty! This also happens with another transformer. The next example i also request the lang field: : : solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id asc&q=*:*&fl=lang,minhash:[binstr] : { : "response":{"numFound":96933,"start":0,"docs":[ : { : "lang":"nl"}] : }} : : Ok, at least i now get the lang field, but the transformed minhash is nowhere to be seen. In the next example i request all fields and the transformed minhash: : : /solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id%20asc&q=*:*&fl=*,minhash:[binstr] : { : "response":{"numFound":96933,"start":0,"docs":[ : { : "minhash":"11101101001010001101001010111101100100110010", : ...other fields here : "_version_":1553728923368423424}] : }} : : So it seems that right now, i can only use a transformer properly if i request all fields. I believe it used to work with all three examples just as you would expect. But since i haven't used transformers for a while, i don't know at which version it stopped working like that (if it ever did of course :) : : Did i mess something up or did a bug creep on me? : : Thanks, : Markus : -Hoss http://www.lucidworks.com/
Re: High increasing slab memory solr 6
On 12/14/2016 7:12 AM, moscovig wrote: > Shawn, thanks for the reply > > Please take a look at that post. It's describing the same issue with ES > > They describe the issue as "dentry cache is bloating memory" > > https://discuss.elastic.co/t/memory-usage-of-the-machine-with-es-is-continuously-increasing/23537/5 They concluded that it was not a problem in ES or Lucene. It's an OS issue, and is mostly only an annoyance, because the memory is reclaimable. If the amount of memory involved is very large, apparently that can cause long stop-the-world pauses as the memory is automatically cleaned up by the OS. There is absolutely nothing that Solr or Lucene (or even ES) can do about this issue. It is perfectly normal for programs to check for the existence of files that do not actually exist at that moment of the check. The issues that can be reached from that post say that attempting to stat nonexistent files is the trigger for the problem in the OS. Updating your OS to the newest update packages (and probably rebooting) might fix it. Thanks, Shawn
DocTransformer not always working
Hello - I just spotted an oddity with all two custom DocTransformers we sometimes use on Solr 6.3.0. This particular transformer in the example just transforms a long (or int) into a sequence of bits. I just use it as an convenience to compare minhashes with my eyeballs. First example is very straightforward, fl=minhash:[binstr], show only the minhash field, but as a bit sequence. solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id%20asc&q=*:*&fl=minhash:[binstr] { "response":{"numFound":96933,"start":0,"docs":[ {}] }} The document is empty! This also happens with another transformer. The next example i also request the lang field: solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id asc&q=*:*&fl=lang,minhash:[binstr] { "response":{"numFound":96933,"start":0,"docs":[ { "lang":"nl"}] }} Ok, at least i now get the lang field, but the transformed minhash is nowhere to be seen. In the next example i request all fields and the transformed minhash: /solr/search/select?omitHeader=true&wt=json&indent=true&rows=1&sort=id%20asc&q=*:*&fl=*,minhash:[binstr] { "response":{"numFound":96933,"start":0,"docs":[ { "minhash":"11101101001010001101001010111101100100110010", ...other fields here "_version_":1553728923368423424}] }} So it seems that right now, i can only use a transformer properly if i request all fields. I believe it used to work with all three examples just as you would expect. But since i haven't used transformers for a while, i don't know at which version it stopped working like that (if it ever did of course :) Did i mess something up or did a bug creep on me? Thanks, Markus
Re: Solr Cloud Replica Cores Give different Results for the Same query
The commit points on different replicas will trip at different wall clock times so the leader and replica may return slightly different results depending on whether doc X was included in the commit on one replica but not on the second. After the _next_ commit interval (2 seconds in your case), doc X will be committed on the second replica: that is it's not lost. Here's a couple of ways to verify: 1> turn off indexing and wait a few seconds. The replicas should have the exact same documents. "A few seconds" is your autocommit (soft in your case) interval + autowarm time. This last is unknown, but you can check your admin/plugins-stats search handler times, it's reported there. Now issue your queries. If the replicas don't report the same docs A Bad Thing that should be worrying. BTW, with a 2 second soft commit interval, which is really aggressive, you _better not_ have very large autowarm intervals! 2> Include a timestamp in your docs when they are indexed. There's an automatic way to do that BTW now do your queries and append an FQ clause like &fq=timestamp:[* TO some_point_in_the_past]. The replicas should have the same counts unless you are deleting documents. I mention deletes on the off chance that you're deleting documents that fall in the interval and then the same as above could theoretically occur. Updates should be fine. BTW, I've seen continuous monitoring of this done by automated scripts. The key is to get the shard URL and ping that with &distrib=false. It'll look something like http://host:port/solr/collection_shard1_replica1 People usually just use *:* and compare numFound. Best, Erick On Wed, Dec 14, 2016 at 1:10 PM, Webster Homer wrote: > We are using Solr Cloud 6.2 > > We have been noticing an issue where the index in a core shows as current = > false > > We have autocommit set for 15 seconds, and soft commit at 2 seconds > > This seems to cause two replicas to return different hits depending upon > which one is queried. > > What would lead to the indexes not being "current"? The documentation on > the meaning of current is vague. > > The collections in our cloud have two shards each with two replicas. I see > this with several of the collections. > > We don't know how they get like this but it's troubling > > -- > > > This message and any attachment are confidential and may be privileged or > otherwise protected from disclosure. If you are not the intended recipient, > you must not copy this message or attachment or disclose the contents to > any other person. If you have received this transmission in error, please > notify the sender immediately and delete the message and any attachment > from your system. Merck KGaA, Darmstadt, Germany and any of its > subsidiaries do not accept liability for any omissions or errors in this > message which may arise as a result of E-Mail-transmission or for damages > resulting from any unauthorized changes of the content of this message and > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its > subsidiaries do not guarantee that this message is free of viruses and does > not accept liability for any damages caused by any virus transmitted > therewith. > > Click http://www.merckgroup.com/disclaimer to access the German, French, > Spanish and Portuguese versions of this disclaimer.
Solr Cloud Replica Cores Give different Results for the Same query
We are using Solr Cloud 6.2 We have been noticing an issue where the index in a core shows as current = false We have autocommit set for 15 seconds, and soft commit at 2 seconds This seems to cause two replicas to return different hits depending upon which one is queried. What would lead to the indexes not being "current"? The documentation on the meaning of current is vague. The collections in our cloud have two shards each with two replicas. I see this with several of the collections. We don't know how they get like this but it's troubling -- This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you must not copy this message or attachment or disclose the contents to any other person. If you have received this transmission in error, please notify the sender immediately and delete the message and any attachment from your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability for any omissions or errors in this message which may arise as a result of E-Mail-transmission or for damages resulting from any unauthorized changes of the content of this message and any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not guarantee that this message is free of viruses and does not accept liability for any damages caused by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer to access the German, French, Spanish and Portuguese versions of this disclaimer.
RE: Traverse over response docs in SearchComponent impl.
Thanks! Running the same code in cloud mode worked nicely almost right away. Getting it to work in non-cloud mode is still non-trivial. I can get the DocList in process(), but AFAIK it just provides Lucene docIds, not a nice DocumentList we could work with. The use-case is straightforward, the resultset contains id's. I collect them and do a bulk getById to another Solr index. Via fl-specified retrieved fields from the remote index are added to the resultset, enriching each document in the server, without intervening middleware. All our server run in cloud mode, so getting it to work in local mode is just a convenience when developing. We have quite a few components that run in cloud and non-cloud mode. Non-cloud mode is for some reason almost always harder to implement, sometimes even at Lucene level with IndexSearcher, hand crafted queries and all. Thanks again, it runs as a charm. Markus -Original message- > From:Chris Hostetter > Sent: Tuesday 13th December 2016 23:27 > To: solr-user > Subject: Re: Traverse over response docs in SearchComponent impl. > > > FWIW: Perhaps an XY problem? can you explain more in depth what it is you > plan on doing in this search component? > > : I can see that Solr calls the component's process() method, but from > : within that method, rb.getResponseDocs(); is always null. No matter what > : i try, i do not seem to be able to get a hold of that list of response > : docs. > > IIRC getResponseDocs() is only non-null when agregating distributed/cloud > resultsfrom multiple shards (where we already have a fully > populated SolrDocumentList due to agregating the remote responses), but in > a single-node Solr request only a "DocList" is used, and the stored field > values are read lazily from the IndexReader by the ResponseWriter. > > So if you're not writting a distributed component, check > ResponseBuilder.getResults() ? > > Even if you are writting a component for a distributed solr setup, what > method you call (and where you call it) depends a lot on when/where you > expect your code to run... > > IIRC: > * prepare() runs on every node for every request (original aggregation > request and every sub-request to each shard). > * distributedProcess runs on the aggregation node, and is called > repeatedly for each "stage" requested by any components (so at a minimum > once, > usually twice to fetch stored fields, maybe more if there are multiple > facet refinement phases, etc...). > * modifyRequest() & handleResponses() are called on the aggregation node > prior/after every sub-request to every shard. > * process() is called on each shard for each sub request. > * finishStage is called on the aggreation node at the ned of each stage > (after all the responses from all shards for that sub-request) > > > ...so something like HighlightComponent does it's main work in the > process() method, because it only needs the data for each doc, the impacts > of other (aggregated) docs don't affect the results -- then later > finishStage combines the results. > > If you on the otherhand want to look at all of the *final* documents being > returned to the user, not on a per-shard basis but on an aggregate basis, > you'd want to put that logic in something like finishStage and check for > the stage that does a GET_FIELDS -- but if you want your component to > *also* work in non-cloud mode, you'd need the same logic in your process() > method (looking at the DocList instead of the SolrDocumentList, with a > conditional to check for distrib=false so you don't waste a bunch of work > on per-shard queries when it is in fact being used in cloud-mode) > > > None of this is very straight forward, but you are admitedly geting int > overy advanced expert territory here. > > > > -Hoss > http://www.lucidworks.com/ >
Re: Searching for a term which isn't a part of an expression
Hi, Do you have a common list of phrases that you want to prohibit partial match? You can index those phrases in a special way, for example, This is a new world hello_world hot_dog tap_water etc. ahmet On Wednesday, December 14, 2016 9:20 PM, deansg wrote: We would like to enable queries for a specific term that doesn't appear as a part of a given expression. Negating the expression will not help, as we still want to return items that contain the term independently, even if they contain full expression as well. For example, we would like to search for items that have the term "world" but not as a part of "hello world". If the text is: "This is a new world. Hello world", we would still want to return the item, as "world" appears independently as well as a part of "Hello world". However, we will not want to return items that only have the expression "hello world" in them. Does Solr support these types of queries? We thought about using regex, but since the text is tokenized I don't think that will be possible. -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-for-a-term-which-isn-t-a-part-of-an-expression-tp4309746.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr on HDFS: increase in query time with increase in data
Hi everyone, I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have the following config. maxShardsperNode: 1 replicationFactor: 1 I have been ingesting data into Solr for the last 3 months. With increase in data, I am observing increase in the query time. Currently the size of my indices is 70 GB per shard (i.e. per node). I am using cursor approach (/export handler) using SolrJ client to get back results from Solr. All the fields I am querying on and all the fields that I get back from Solr are indexed and have docValues enabled as well. What could be the reason behind increase in query time? Has this got something to do with the OS disk cache that is used for loading the Solr indices? When a query is fired, will Solr wait for all (70GB) of disk cache being available so that it can load the index file? Thnaks!
Searching for a term which isn't a part of an expression
We would like to enable queries for a specific term that doesn't appear as a part of a given expression. Negating the expression will not help, as we still want to return items that contain the term independently, even if they contain full expression as well. For example, we would like to search for items that have the term "world" but not as a part of "hello world". If the text is: "This is a new world. Hello world", we would still want to return the item, as "world" appears independently as well as a part of "Hello world". However, we will not want to return items that only have the expression "hello world" in them. Does Solr support these types of queries? We thought about using regex, but since the text is tokenized I don't think that will be possible. -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-for-a-term-which-isn-t-a-part-of-an-expression-tp4309746.html Sent from the Solr - User mailing list archive at Nabble.com.
Nested JSON Facets (Subfacets)
Hi all, this is about using a function in nested facets, specifically the „sum()“ function inside a „terms“ facet using the json.facet api. My json.facet parameter looks like this: json.facet={shop_cat: {type:terms, field:shop_cat, facet: {cat_pop:"sum(popularity)"}}} A snippet of the result: "facets“: { "count":2508, "shop_cat“: { "buckets“: [{ "val“: "Men > Clothing > Jumpers & Cardigans", "count":252, "cat_pop“:0.0 }, { "val":"Men > Clothing > Jackets & Coats", "count":157, "cat_pop“:0.0 }, // and more This looks fine all over but it turns out that „cat_pop“, the result of „sum(popularity)“ is always 0.0 even if the documents for this facet value have popularities > 0. A quick check with this json.facet parameter: json.facet: {cat_pop:"sum(popularity)“} returns: "facets“: { "count":2508, "cat_pop":21.0}, To me, it seems it works fine on the base level but not when nested. Still, Yonik’s documentation and the Jira issues indicate that it is possible to use functions in nested facets so I might just be using the wrong structure? I have a hard time finding any other examples on the i-net and I had no luck changing the structure around. Could someone shed some light on this for me? It would also help to know if it is not possible to sum the values up this way. Thanks a lot! Chantal
Re: "on deck" searcher vs warming searcher
: In a situation where searchers A-E are queued in the states : A: Current : B: Warming : C: Ondeck : D: Ondeck : E: Being created with newSearcher : : wouldn't it make sense to discard C before it gets promoted to Warming, : as the immediate action after warming C would be to start warming D? : : Are there some situations where the (potentially extremely short lived) : C searcher must be visible before D replaces it? In theory it might make sense to throw out C, but in practice: 1) since maxWarmingSearchers is typically a small value, E (and sometimes D) are rarely created 2) because of how the code is structured, discarding C isn't particularly easy ... the calls are happening in parallel threads, ie: some Thread#1 is warming B while some thread #2 has just opened C and is blocked on the single threaded warming executor while waiting to warm it. When Thread #3 comes along and opens D, it also gets blocked on the same executor. We'd need to revamp that code in some way that the existence of Thread #3 (and beyond) while Thread #2 is queued up would cause Thread #2 to close C (w/o warming it) and instead be blocked waiting for D to warm -- such that once D completes warming both Thread #2 and Thread #3 both return D. All of which is complicated by the fact that the code is actaully returning the Searchers immediately, but also returning/setting a Future ref that is what's waiting on the warming to finish -- so callers can actaully use the searchers concurrently with the warming (ie: useColdSearchers) if they wish. So in a nutshell: yes, but it would be a pretty invasive change, and AFAIK rarely impacts people who don't already have bigger problems. -Hoss http://www.lucidworks.com/
Re: Reg: Is there a way to query solr leader directly using solrj?
First off I'm a bit confused. You say you're working with an UpdateProcessorFactory but then want to use SolrJ to get a leader. Why do this? Why not just work entirely locally and reach into the _local_ index (note, you have to do this after the doc has been routed to the correct shard)? Once there you should be able to use the real-time get functionality to get the latest version that's been sent regardless of whether it's been committed or not. And in the middle of this you say you're "pointing to the leader", which implies you're really doing this from some external SolrJ client, not as part of an update chain at all. So I'm missing something. Or are you talking about doing this on the _client_? To answer your question, though, CloudSolrClient.getCollection(...).getLeader(...)... Best, Erick On Wed, Dec 14, 2016 at 4:48 AM, indhu priya wrote: > Hi, > > In my project I have one leader and one replica architecture. > I am using custom code( using DocumentUpdateProcessorFactory) for merging > old documents with incoming new documents. > > eg. 1. if 1st document have 10 fields, all 10 fields will be indexed. > 2. if 2nd document have 8 fields, 5 of which are in old document and > 3 are in new document, then we will find the old document in index(using > solrJ), then update the 5 fields of old document and add the new 3 fields > with old document and hence we have total 13 updated fields in result > document. > > When I am pointing to leader and do indexing, I am not facing any issues. > But if I point to replica, then I am facing issues. since document > distribution from replica to leader and again to replica is taking time. > > Eg. If first document comes in replica at time t1, then the distribution to > leader happens at t2 and then the leader distributes it to replica at time > t3. But the second in-coming document is coming before t3 and hence the > custom code is not able to find its old document for merge. > > Hence, I need to know whether there is any simple way to query leader > directly using solrj other than finding the leader using zookeeper and then > hitting http url? > > Notes: We are using SOLR 5.5 and i tried using zookeeper but zookeeper is > distributing the query. > > Please let me know if any queries. > > Thanks, > Indhupriya.S
Reg: Is there a way to query solr leader directly using solrj?
Hi, In my project I have one leader and one replica architecture. I am using custom code( using DocumentUpdateProcessorFactory) for merging old documents with incoming new documents. eg. 1. if 1st document have 10 fields, all 10 fields will be indexed. 2. if 2nd document have 8 fields, 5 of which are in old document and 3 are in new document, then we will find the old document in index(using solrJ), then update the 5 fields of old document and add the new 3 fields with old document and hence we have total 13 updated fields in result document. When I am pointing to leader and do indexing, I am not facing any issues. But if I point to replica, then I am facing issues. since document distribution from replica to leader and again to replica is taking time. Eg. If first document comes in replica at time t1, then the distribution to leader happens at t2 and then the leader distributes it to replica at time t3. But the second in-coming document is coming before t3 and hence the custom code is not able to find its old document for merge. Hence, I need to know whether there is any simple way to query leader directly using solrj other than finding the leader using zookeeper and then hitting http url? Notes: We are using SOLR 5.5 and i tried using zookeeper but zookeeper is distributing the query. Please let me know if any queries. Thanks, Indhupriya.S
Re: High increasing slab memory solr 6
In the mean time I am removing all the explicit commits we have in the code. Will update if it got better -- View this message in context: http://lucene.472066.n3.nabble.com/High-increasing-slab-memory-solr-6-tp4309708p4309718.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Has anyone used linode.com to run Solr | ??Best way to deliver PHP/Apache clients with Solr question
Thanks, I understand accessing solr directly. I'm doing REST calls to a single machine. If I have a cluster of five servers and say three Apache servers, I can round robin the REST calls to all five in the cluster? I guess I'm going to find out. :-) If so I might be better off just running Apache on all my solr instances. On 14 December 2016 at 07:08, Dorian Hoxha wrote: > See replies inline: > > On Wed, Dec 14, 2016 at 11:16 AM, GW wrote: > > > Hello folks, > > > > I'm about to set up a Web service I created with PHP/Apache <--> Solr > Cloud > > > > I'm hoping to index a bazillion documents. > > > ok , how many inserts/second ? > > > > > I'm thinking about using Linode.com because the pricing looks great. Any > > opinions?? > > > Pricing is 'ok'. For bazillion documents, I would skip vps and go straight > dedicated. Check out ovh.com / online.net etc etc > > > > > I envision using an Apache/PHP round robin in front of a solr cloud > > > > My thoughts are that I send my requests to the Solr instances on the > > Zookeeper Ensemble. Am I missing something? > > > You contact with solr directly, don't have to connect to zookeeper for > loadbalancing. > > > > > What can I say.. I'm software oriented and a little hardware challenged. > > > > Thanks in advance, > > > > GW > > >
Re: High increasing slab memory solr 6
Shawn, thanks for the reply Please take a look at that post. It's describing the same issue with ES They describe the issue as "dentry cache is bloating memory" https://discuss.elastic.co/t/memory-usage-of-the-machine-with-es-is-continuously-increasing/23537/5 Thanks Gilad -- View this message in context: http://lucene.472066.n3.nabble.com/High-increasing-slab-memory-solr-6-tp4309708p4309713.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr has a CPU% spike when indexing a batch of data
On 12/14/2016 1:28 AM, forest_soup wrote: > We are doing index on the same http endpoint. But as we have shardnum=1 and > replicafactor=1, so each collection only has one core. So there should no > distributed update/query, as we are using solrj's CloudSolrClient which will > get the target URL of the solrnode when requesting to each collection. > > For the questions: > * What is the total physical memory in the machine? > 128GB > > * What is the max heap on each of the two Solr processes? > 32GB for each > > * What is the total index size in each Solr process? > Each Solr node(process) has 16 cores. 130GB for each solr core. So totally >> 2000G for each solr node. This means that you have approximately 64GB left for your OS after deducting the heap sizes, which it must use for itself and for OS disk caching. With nearly 2 terabytes of index data on the machine, 64GB is nowhere near enough for good performance. The server will be VERY busy whenever there is query activity, so the CPU spike is what I would expect. For that much index data, I would hope to have somewhere between 512GB and 2 terabytes of memory. Adding machines and/or increasing memory in each machine would make your performance better and reduce CPU load. https://wiki.apache.org/solr/SolrPerformanceProblems > * What is the total tlog size in each Solr process? > 25m for each core. So totally 400m for each solr node. > > > ${solr.ulog.dir:} > name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536} > 1 > 100 > Compared to the amount of index data, 400MB is tiny, but this will take a long time to process on restart. You might want to consider lowering the amount of data that the update log keeps so restarts are faster. > * What are your commit characteristics like -- both manual and automatic. > > > 1 > ${solr.autoCommit.maxTime:59000} > false > > > 5000 > ${solr.autoSoftCommit.maxTime:31000} > I would personally remove the "maxDocs" portion of these settings and do the automatic commits based purely on time. For the amount of data you're handling, those are very low maxDocs numbers, and could result in very frequent commits when you index. The time values are lower than I would prefer, but are probably OK. The number of collections should be no problem. If there were hundreds or thousands, that might be different. Thanks, Shawn
Re: Collection API CREATE creates name like '_shard1_replica1'
On 12/14/2016 1:36 AM, Sandeep Khanzode wrote: > I uploaded (upconfig) config (schema and solrconfig XMLs) to Zookeeper > and then linked (linkconfig) the confname to a collection name. When I > attempt to create a collection using the API like this > .../solr/admin/collections?action=CREATE&name=abc&numShards=1&collection.configName=abc > ... > it creates a collection core named abc_shard1_replica1 and not simply > abc. This is exactly what it is supposed to do. These are the *core* names. Each core is a shard replica. The minimum value for shard count and replica count is 1. When making queries or update requests to Solr, you can still use the "abc" name, and SolrCloud will figure out which cores on which machines need to receive the request. Thanks, Shawn
Re: Solr - Amazon like search
On 12/13/2016 10:55 PM, vasanth vijayaraj wrote: > We are building an e-commerce mobile app. I have implemented Solr search and > autocomplete. > But we like the Amazon search and are trying to implement something like > that. Attached a screenshot > of what has been implemented so far > > The search/suggest should sort list of products based on popularity, document > hits and more. > How do we achieve this? Please help us out here. Your attachment didn't make it to the list. They rarely do. We can't see whatever it is you were trying to include. Sorting on things like popularity and hits requires putting that information into the index so that each document has fields that encode this information, allowing you to use Solr's standard sorting functionality with those fields. You also need a process to update that information when there's a new hit. It's possible, but you have to write this into your indexing system. Solr doesn't include special functionality for this. It would be hard to generalize, and it can all be done without special functionality. Thanks, Shawn
Re: High increasing slab memory solr 6
On 12/14/2016 5:55 AM, moscovig wrote: > We have solr 6.2.1. > One of the collection is causing lots of updates. > We see the next logs: > > /INFO org.apache.solr.core.SolrDeletionPolicy : > SolrDeletionPolicy.onCommit: commits: num=2 > > commit{dir=/opt/solr-6.2.1/server/solr/collection_shard1_replica2/data/index,segFN=segments_qbmv,generation=1228135} > > commit{dir=/opt/solr-6.2.1/server/solr/collection_shard1_replica2/data/index,segFN=segments_qbmw,generation=1228136}/ Those do not look like any problem at all. The first one says INFO, the others probably do too, but what's here doesn't include the severity. > As a result we are running out of memory in the instances hosting the > collection. > The used memory is increased by 1 percent per day. > > The used memory is not part of the Solr's JVM, but part of the Slab memory > (which I get to know now :) ) Solr does not explicitly allocate memory outside of the JVM. Solr (via Java) uses MMAP for access to index data, which relies on the OS using memory for the disk cache, but this is normal OS function, and not anything unusual. The OS can instantly re-allocate any of that memory for use by programs that request it. > when cat over /proc/meminfo we get: > / > Slab: 17906760 kB > SReclaimable: 17841548 kB > / > > and slabtop gives: > 91635138 91635138 6%0.19K 4363578 21 17454312K dentry > > ~17 GB for dentry. > > Is there any way to avoid this "memory leak"? > > echo 2 > /proc/sys/vm/drop_caches ; sync is cleaning the this "clean" cache > but - This sounds like the memory is being used for the OS disk cache -- which is completely normal, and exactly how your spare memory SHOULD be used. Solr has no control over this, and it's very likely that Java doesn't either. This is *NOT* a memory leak. Your OS is working exactly how it is supposed to work -- using otherwise unallocated memory to speed up the system. If a program requests any of that memory, the OS will instantly release whatever the program requests. https://en.wikipedia.org/wiki/Page_cache https://wiki.apache.org/solr/SolrPerformanceProblems http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html > 2. What's the meaning of the SolrDeletionPolicy logs we get? do we commit > lots of updates? deletes? The deletion policy has to do with commit points. This is Lucene functionality that Solr doesn't really use -- a typical deletion policy will delete all but the most recent commit point. Solr has a lot of logging by default that is not useful to the average person, but can provide information vital to a developer who's familiar with Lucene. Thanks, Shawn
High increasing slab memory solr 6
Hi We have solr 6.2.1. One of the collection is causing lots of updates. We see the next logs: /INFO org.apache.solr.core.SolrDeletionPolicy : SolrDeletionPolicy.onCommit: commits: num=2 commit{dir=/opt/solr-6.2.1/server/solr/collection_shard1_replica2/data/index,segFN=segments_qbmv,generation=1228135} commit{dir=/opt/solr-6.2.1/server/solr/collection_shard1_replica2/data/index,segFN=segments_qbmw,generation=1228136}/ As a result we are running out of memory in the instances hosting the collection. The used memory is increased by 1 percent per day. The used memory is not part of the Solr's JVM, but part of the Slab memory (which I get to know now :) ) when cat over /proc/meminfo we get: / Slab: 17906760 kB SReclaimable: 17841548 kB / and slabtop gives: 91635138 91635138 6%0.19K 4363578 21 17454312K dentry ~17 GB for dentry. Is there any way to avoid this "memory leak"? echo 2 > /proc/sys/vm/drop_caches ; sync is cleaning the this "clean" cache but - 1. Either from the OS side or from the solr collection side? 2. What's the meaning of the SolrDeletionPolicy logs we get? do we commit lots of updates? deletes? Thanks Gilad -- View this message in context: http://lucene.472066.n3.nabble.com/High-increasing-slab-memory-solr-6-tp4309708.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Has anyone used linode.com to run Solr | ??Best way to deliver PHP/Apache clients with Solr question
See replies inline: On Wed, Dec 14, 2016 at 11:16 AM, GW wrote: > Hello folks, > > I'm about to set up a Web service I created with PHP/Apache <--> Solr Cloud > > I'm hoping to index a bazillion documents. > ok , how many inserts/second ? > > I'm thinking about using Linode.com because the pricing looks great. Any > opinions?? > Pricing is 'ok'. For bazillion documents, I would skip vps and go straight dedicated. Check out ovh.com / online.net etc etc > > I envision using an Apache/PHP round robin in front of a solr cloud > > My thoughts are that I send my requests to the Solr instances on the > Zookeeper Ensemble. Am I missing something? > You contact with solr directly, don't have to connect to zookeeper for loadbalancing. > > What can I say.. I'm software oriented and a little hardware challenged. > > Thanks in advance, > > GW >
Has anyone used linode.com to run Solr | ??Best way to deliver PHP/Apache clients with Solr question
Hello folks, I'm about to set up a Web service I created with PHP/Apache <--> Solr Cloud I'm hoping to index a bazillion documents. I'm thinking about using Linode.com because the pricing looks great. Any opinions?? I envision using an Apache/PHP round robin in front of a solr cloud My thoughts are that I send my requests to the Solr instances on the Zookeeper Ensemble. Am I missing something? What can I say.. I'm software oriented and a little hardware challenged. Thanks in advance, GW
Re: "on deck" searcher vs warming searcher
On Tue, 2016-12-13 at 16:07 -0700, Chris Hostetter wrote: > ** "warming" happens i na single threaded executor -- so if there > are multiple ondeck searchers, only one of them at a time is ever a > "warming" searcher > ** multiple ondeck searchers can be a sign of a potential performance > problem (hence the log warning) [...] In a situation where searchers A-E are queued in the states A: Current B: Warming C: Ondeck D: Ondeck E: Being created with newSearcher wouldn't it make sense to discard C before it gets promoted to Warming, as the immediate action after warming C would be to start warming D? Are there some situations where the (potentially extremely short lived) C searcher must be visible before D replaces it? - Toke Eskildsen, State and University Library, Denmark
Collection API CREATE creates name like '_shard1_replica1'
Hi, I uploaded (upconfig) config (schema and solrconfig XMLs) to Zookeeper and then linked (linkconfig) the confname to a collection name. When I attempt to create a collection using the API like this .../solr/admin/collections?action=CREATE&name=abc&numShards=1&collection.configName=abc ... it creates a collection core named abc_shard1_replica1 and not simply abc. What is missing? SRK
Re: Solr has a CPU% spike when indexing a batch of data
Thanks, Shawn! We are doing index on the same http endpoint. But as we have shardnum=1 and replicafactor=1, so each collection only has one core. So there should no distributed update/query, as we are using solrj's CloudSolrClient which will get the target URL of the solrnode when requesting to each collection. For the questions: * What is the total physical memory in the machine? 128GB * What is the max heap on each of the two Solr processes? 32GB for each * What is the total index size in each Solr process? Each Solr node(process) has 16 cores. 130GB for each solr core. So totally >2000G for each solr node. * What is the total tlog size in each Solr process? 25m for each core. So totally 400m for each solr node. ${solr.ulog.dir:} ${solr.ulog.numVersionBuckets:65536} 1 100 * What are your commit characteristics like -- both manual and automatic. 1 ${solr.autoCommit.maxTime:59000} false 5000 ${solr.autoSoftCommit.maxTime:31000} * Do you have WARN or ERROR messages in your logfile? No. * How many collections are in each cloud? 80 collections with only one shard each. And replicafactor=1. * How many servers are in each cloud? 5 solr nodes. So each solr node has 16 cores. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-has-a-CPU-spike-when-indexing-a-batch-of-data-tp4309529p4309669.html Sent from the Solr - User mailing list archive at Nabble.com.