Re: Trending functionality in Solr
Folks, Thanks for this wealth of information , the consensus generally seems to be that one should be able to save the queries in Solr core (another one) and then times stamp it to do further analysis . I will try and implement the same . Siegfried, I looked at your JIRA issue which is impressive but would be an overkill in my situation so , I will implementing something simpler for use in my case. Thanks again everyone for this help. On Mon, Feb 9, 2015 at 3:14 AM, Siegfried Goeschl sgoes...@gmx.at wrote: Hi folks, I implemented something similar but never got around to contribute it - see https://issues.apache.org/jira/browse/SOLR-4056 The code was initially for SOLR3 but was recently ported to SOLR4 * capturing the most frequent search terms per core * supports ad-hoc queries * CSV export If you are interested we could team up and make a proper SOLR contribution :-) Cheers, Siegfried Goeschl On 08.02.15 05:26, S.L wrote: Folks, Is there a way to implement the trending functionality using Solr , to give the results using a query for say the most searched terms in the past hours or so , if the most searched terms is not possible is it possible to at least the get results for the last 100 terms? Thanks
Re: [MASSMAIL]Re: Trending functionality in Solr
Thanks for stating this in a simple fashion. On Sun, Feb 8, 2015 at 6:07 PM, Jorge Luis Betancourt González jlbetanco...@uci.cu wrote: For a project I'm working on, what we do is store the user's query in a separated core that we also use to provide an autocomplete query functionality, so far, the frontend app is responsible of sending the query to Solr, meaning: 1. execute the query against our search core and 2. send an update request to store the query in the separated core. We use some deduplication (provided by Solr) to avoid storing the same query several times. We don't do what you're after but it would't be to hard to tag each query with a timestamp field and provide analytics. Thinking from the top of my head we could wrap this logic that is currently done in the frontend app in a custom SearchComponent that automatically send the search query into the other core for storing, abstracting all this logic from the client app. Keep in mind that the considerations regarding volume of data that Shawn has talked keeps being valid. Hope it helps, - Original Message - From: Shawn Heisey apa...@elyograg.org To: solr-user@lucene.apache.org Sent: Sunday, February 8, 2015 11:03:33 AM Subject: [MASSMAIL]Re: Trending functionality in Solr On 2/7/2015 9:26 PM, S.L wrote: Is there a way to implement the trending functionality using Solr , to give the results using a query for say the most searched terms in the past hours or so , if the most searched terms is not possible is it possible to at least the get results for the last 100 terms? I'm reasonably sure that the only thing Solr has out of the box that can record queries is the logging feature that defaults to INFO. That data is not directly available to Solr, and it's not in a good format for easy parsing. Queries are not stored anywhere else by Solr. From what I understand, analysis is a relatively easy part of the equation, but the data must be available first, which is the hard part. Storing it in RAM is pretty much a non-starter -- there are installations that see thousands of queries every second. This is an area for improvement, but the infrastructure must be written from scratch. All work on this project is volunteer. We are highly motivated volunteers, but extensive work like this is difficult to fit into donated time. Many people who use Solr are already recording all queries in some other system (like a database), so it is far easier to implement analysis on that data. Thanks, Shawn
Trending functionality in Solr
Folks, Is there a way to implement the trending functionality using Solr , to give the results using a query for say the most searched terms in the past hours or so , if the most searched terms is not possible is it possible to at least the get results for the last 100 terms? Thanks
DocExpirationUpdateProcessorFactory not deleting records
I am trying to use the DocExpirationUpdateProcessorFactoryfactory in Solr 4.10.1 version. I have included the following in my solrconfig.xml updateRequestProcessorChain default=true processor class=solr.UUIDUpdateProcessorFactory str name=fieldNameid/str /processor processor class=solr.TimestampUpdateProcessorFactory str name=fieldNametimestamp/str /processor processor class=solr.processor.DocExpirationUpdateProcessorFactory int name=autoDeletePeriodSeconds30/int str name=ttlFieldNamettl/str str name=expirationFieldNameexpire_at/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain And I have included the following in my schema.xml field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ field name=ttl type=date indexed=true stored=true default=NOW+60SECONDS multiValued=false/ field name=expire_at type=date indexed=true stored=true multiValued=false/ As you can see I am setting the time to live to be 60 seconds and checking to delete every 30 seconds, when I insert a document , and check after a minute or couple or an hour it never gets deleted. This is what I see in the indexed document , can you please let me know what might be the issue here ? Please note that the expire_at field is never getting generated in the Solr document as can be seen below. id: 3888a8ac-fbc4-437a-8248-132384753c00, timestamp: 2015-02-04T04:09:21.29Z, _version_: 1492147724740460500, ttl: 2015-02-04T04:10:21.29Z
Re: DocExpirationUpdateProcessorFactory not deleting records
Thanks for giving multiple options , I ll try them out both ,but last time I checked, having +60SECONDS as the default value for ttl was giving me an invalid date format exception, I am assuming that would only be the case if I use it with the default mechanism in schema.xml, but not when we use the solr.DefaultValueUpdateProcessorFactory ? On Wed, Feb 4, 2015 at 1:56 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : processor : class=solr.processor.DocExpirationUpdateProcessorFactory : int name=autoDeletePeriodSeconds30/int : str name=ttlFieldNamettl/str : str name=expirationFieldNameexpire_at/str : /processor ... : And I have included the following in my schema.xml : : field name=ttl type=date indexed=true stored=true : default=NOW+60SECONDS multiValued=false/ there are a couple of problems here... : As you can see I am setting the time to live to be 60 seconds and checking : to delete every 30 seconds, when I insert a document , and check after a : minute or couple or an hour it never gets deleted. first off: you aren't actaully setting the ttl to 60 seconds you are setting the ttl to be a fixed moment in time which is 60 seconds from when the doc is written to the index -- basically you are eliminating hte need for having a ttl field/param at all and saying this is *exactly* when the document should expire. if that's what you want to do, just elimintae the ttleFieldName everywhere in your schema.xml and solrconfig.xml and setup expire_at in your schema.xml with a default=NOW+60SECONDS and you'll probably be good to go. second... : what might be the issue here ? Please note that the expire_at field is : never getting generated in the Solr document as can be seen below. ...even if you redefined your ttl field to look like this... field name=ttl type=string default=+60SECONDS / ...the expire_at still wouldn't be populated by the processor because schema field default values are populated *after* the processors run -- so when the DocExpirationUpdateProcessorFactory sees the documents being added, it has no idea that they all have a default ttl, so it doesn't know that you want it to compute an expire_at for you. instead of using default= in the schema, you can use the DefaultValueUpdateProcessorFactory to assign it *before* the DocExpirationUpdateProcessorFactory sees the doc... processor class=solr.DefaultValueUpdateProcessorFactory str name=fieldNamettl/str str name=value+60SECONDS/str /processor -Hoss http://www.lucidworks.com/
Re: DocExpirationUpdateProcessorFactory not deleting records
Great, this is the first example I have seen so far, I wish we could include this in the Wiki. Thanks again! On Wed, Feb 4, 2015 at 2:04 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : : Thanks for giving multiple options , I ll try them out both ,but last time : I checked, having +60SECONDS as the default value for ttl was giving me : an invalid date format exception, I am assuming that would only be the that's because ttl should not be a date field -- it should be a *string* (as noted in my examples) time to live is a date math expression that the processor will evaluate for you -- not a date. if you want to specify an explicit date, just set expire_at directly. ie: do you wnat to do the match yourself (set expire_at as a date field) or do you want the processor to do the math itself (set ttl as a string field) : ...even if you redefined your ttl field to look like this... : :field name=ttl type=string default=+60SECONDS / : : ...the expire_at still wouldn't be populated by the processor because : schema field default values are populated *after* the processors run -- : so when the DocExpirationUpdateProcessorFactory sees the documents being : added, it has no idea that they all have a default ttl, so it doesn't know : that you want it to compute an expire_at for you. : : instead of using default= in the schema, you can use the : DefaultValueUpdateProcessorFactory to assign it *before* the : DocExpirationUpdateProcessorFactory sees the doc... : : processor class=solr.DefaultValueUpdateProcessorFactory : str name=fieldNamettl/str : str name=value+60SECONDS/str : /processor -Hoss http://www.lucidworks.com/
Re: distrib=false
Erik I have attached the screen shot of the toplogy , as you can see I have three nodes and no two replicas of the same shard reside on the same node, this was made sure so as not affect the availability. The query that I use is a general get all query of type *:* to test . The behavior I notice is that even though when a particular replica of a shard is queried using distrib=false , the request goes to the other replica of the same shard. Thanks. On Sat, Dec 27, 2014 at 2:10 PM, Erick Erickson erickerick...@gmail.com wrote: How are you sending the request? AFAIK, setting distrib=false should should keep the query from being sent to any other node, although I'm not quite sure what happens when you host multiple replicas of the _same_ shard on the same node. So we need: 1 your topology, How many nodes and what replicas on each? 2 the actual query you send. Best, Erick On Sat, Dec 27, 2014 at 8:14 AM, S.L simpleliving...@gmail.com wrote: Hi All, I have a question regarding distrib=false on the Solr query , it seems that the distribution is restricted across only the shards when the parameter is set to false, meaning if I query a particular node with in a shard with replication factor of more than one , the request could go to another node with in the same shard which is a replica of the node that I made the initial request to, is my understanding correct ? If the answer to my question is yes, then how do we make sure that the request goes to only the node I intend to make the request to ? Thanks.
How to implement multi-set in a Solr schema.
Hi All, I have a use case where I need to group documents that have a same field called bookName , meaning if there are a multiple documents with the same bookName value and if the user input is searched by a query on bookName , I need to be able to group all the documents by the same bookName together, so that I could display them as a group in the UI. What kind of support does Solr provide for such a scenario , and how should I look at changing my schema.xml which as bookName as single valued text field ? Thanks.
distrib=false
Hi All, I have a question regarding distrib=false on the Solr query , it seems that the distribution is restricted across only the shards when the parameter is set to false, meaning if I query a particular node with in a shard with replication factor of more than one , the request could go to another node with in the same shard which is a replica of the node that I made the initial request to, is my understanding correct ? If the answer to my question is yes, then how do we make sure that the request goes to only the node I intend to make the request to ? Thanks.
Re: 'Illegal character in query' on Solr cloud 4.10.1
Jack, I am using this query to test from the browser and this occurs consistently for the 5 out of the 6 servers in the cluster, but the actual API that I use is pysolr, so from the front end its sent using pysolr. I face the same issue in both Firefox and Google Chrome, the fact that there is an existing Jira for a similar issue , made me think this is a Solr issue , but I am still not clear how I can circumvent this issue. On Wed, Dec 24, 2014 at 4:57 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Is the problem here that the error occurs sometimes or that it doesn't occur all of the time? I mean, it is clearly a bug in the client if it is sending a raw circumflex rather than a URL-encoded circumflex. Also, some browsers automatically URL-encode character as needed, but I have heard that some browsers don't always encode all of the characters. Question: You mention the URL, but how are you sending that URL to Solr - via a browser address box, curl, or... what? If using curl, you also have to cope with some characters having a shell meaning and needing to be escaped. Whether it is Tomcat or Solr that gives the error, the main point is that the raw circumflex shouldn't be sent to either. -- Jack Krupansky On Wed, Dec 24, 2014 at 4:32 PM, Erick Erickson erickerick...@gmail.com wrote: OK, then I don't think it's a Solr problem. I think 5 of your Tomcats are configured in such a way that they consider ^ to be an illegal character. There have been recurring problems with Servlet containers being configured to allow/disallow various characters, and I think that's what's happening here. But this is totally outside Solr. Solr, when it successfully distributes a query, sends the query on to one replica of each shard, and I was wondering if that process wasn't working correctly somehow, although boosting is so common that it would be a huge shock since it would have broken almost every Tomcat installation out there. By sending the query directly to each node, you've bypassed any forwarding by Solr so it looks like the problem is before Solr even sees it. So my guess is that somehow 5 of your servers are configured to expect a different character than the server that works. I'm afraid I don't know Tomcat well enough to direct you there, but take a look here: https://wiki.apache.org/solr/SolrTomcat Sorry I can't be more help Erick On Wed, Dec 24, 2014 at 1:33 AM, S.L simpleliving...@gmail.com wrote: Erik, The scenario 1, that you have listed is what seems to be the case. When I add distrib=false to query each one of the 6 servers only 1 of them returns results (partial) and the rest of them give the illegal character error . I have not set up any special logging I do not see any info in the catalina.out but in a file called localhost_access_log.2014-12-24.txt in tomcat/logs directory, I see the following logging message when the invalid character error occurs. [24/Dec/2014:09:25:54 +] GET /solr/dyCollection1_shard2_replica1/?fl=*,scoreq=canon+pixma+printersort=score+desc,productNameLength%20ascwt=jsonindent=truerows=100defType=edismaxqf=productNamemm=2pf=productNameps=1pf2=productNamepf3=productNamestopwords=truelowercaseOperators=truebq=hasThumbnailImage:true^2.0distrib=false HTTP/1.1 500 7781 I am using Tomcat 7.0.42 and SolrCloud 4.10.1 and the Oracle JDK . java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) Thanks. On Tue, Dec 23, 2014 at 11:46 AM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, so you are you pinging the servers directly, right? Here's a couple of things to try: 1 add distrib=false to the query and try each of the 6 servers. What I'm wondering is if this is happening on the sub-query sent out or on the primary server. Adding distrib=false will just execute on the node you're sending it to, and will NOT send sub-queries out to any other node so you'll get partial results back. If one server continues to work but the other 5 fail, then your servlet container is probably not set up with the right character sets. Although why that would manifest itself on the ^ character mystifies me. 2 Let's assume that all 6 servers handle the raw query. Next thing that would be really helpful is to see the sub-queries. Take distrib=false off and tail the logs on all the servers. What we're looking for here is whether the sub-queries even make it to Solr or whether the problem is in your container. 3 If the sub-queries do NOT make it to the Solr logs, what is the query that the container sees? Is it recognizable or has Solr somehow munged the sub-query? What is your environment like? Tomcat? Jetty? Other? What JVM etc? Best, Erick On Tue, Dec 23
Re: 'Illegal character in query' on Solr cloud 4.10.1
Erik, The scenario 1, that you have listed is what seems to be the case. When I add distrib=false to query each one of the 6 servers only 1 of them returns results (partial) and the rest of them give the illegal character error . I have not set up any special logging I do not see any info in the catalina.out but in a file called localhost_access_log.2014-12-24.txt in tomcat/logs directory, I see the following logging message when the invalid character error occurs. [24/Dec/2014:09:25:54 +] GET /solr/dyCollection1_shard2_replica1/?fl=*,scoreq=canon+pixma+printersort=score+desc,productNameLength%20ascwt=jsonindent=truerows=100defType=edismaxqf=productNamemm=2pf=productNameps=1pf2=productNamepf3=productNamestopwords=truelowercaseOperators=truebq=hasThumbnailImage:true^2.0distrib=false HTTP/1.1 500 7781 I am using Tomcat 7.0.42 and SolrCloud 4.10.1 and the Oracle JDK . java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) Thanks. On Tue, Dec 23, 2014 at 11:46 AM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, so you are you pinging the servers directly, right? Here's a couple of things to try: 1 add distrib=false to the query and try each of the 6 servers. What I'm wondering is if this is happening on the sub-query sent out or on the primary server. Adding distrib=false will just execute on the node you're sending it to, and will NOT send sub-queries out to any other node so you'll get partial results back. If one server continues to work but the other 5 fail, then your servlet container is probably not set up with the right character sets. Although why that would manifest itself on the ^ character mystifies me. 2 Let's assume that all 6 servers handle the raw query. Next thing that would be really helpful is to see the sub-queries. Take distrib=false off and tail the logs on all the servers. What we're looking for here is whether the sub-queries even make it to Solr or whether the problem is in your container. 3 If the sub-queries do NOT make it to the Solr logs, what is the query that the container sees? Is it recognizable or has Solr somehow munged the sub-query? What is your environment like? Tomcat? Jetty? Other? What JVM etc? Best, Erick On Tue, Dec 23, 2014 at 3:23 AM, S.L simpleliving...@gmail.com wrote: Hi All, I am using SolrCloud 4.10.1 and I have 3 shards with replication factor of 2 , i.e is 6 nodes altogether. When I query the server1 out of 6 nodes in the cluster with the below query , it works fine , but any other node in the cluster when queried with the same query results in a *HTTP Status 500 - {msg=Illegal character in query at index 181:* error. The character at index 181 is the boost character ^. I have see a Jira SOLR-5971 https://issues.apache.org/jira/browse/SOLR-5971 for a similar issue , how can I overcome this issue. The query I use is below. Thanks in Advance! http://xx2..com:8081/solr/dyCollection1_shard2_replica1/?q=x+x+xxsort=score+descwt=jsonindent=truedebugQuery=truedefType=edismaxqf=productName ^1.5+productDescriptionmm=1pf=productName+productDescriptionps=1pf2=productName+productDescriptionpf3=productName+productDescriptionstopwords=truelowercaseOperators=true
'Illegal character in query' on Solr cloud 4.10.1
Hi All, I am using SolrCloud 4.10.1 and I have 3 shards with replication factor of 2 , i.e is 6 nodes altogether. When I query the server1 out of 6 nodes in the cluster with the below query , it works fine , but any other node in the cluster when queried with the same query results in a *HTTP Status 500 - {msg=Illegal character in query at index 181:* error. The character at index 181 is the boost character ^. I have see a Jira SOLR-5971 https://issues.apache.org/jira/browse/SOLR-5971 for a similar issue , how can I overcome this issue. The query I use is below. Thanks in Advance! http://xx2..com:8081/solr/dyCollection1_shard2_replica1/?q=x+x+xxsort=score+descwt=jsonindent=truedebugQuery=truedefType=edismaxqf=productName ^1.5+productDescriptionmm=1pf=productName+productDescriptionps=1pf2=productName+productDescriptionpf3=productName+productDescriptionstopwords=truelowercaseOperators=true
Re: Length norm not functioning in solr queries.
Mikhail, Thank you for confirming this , however Ahmet's proposal seems more simpler to implement to me . On Wed, Dec 10, 2014 at 5:07 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: S.L, I briefly skimmed Lucene50NormsConsumer.writeNormsField(), my conclusion is: if you supply own similarity, which just avoids putting float to byte in Similarity.computeNorm(FieldInvertState), you get right this value in . Similarity.decodeNormValue(long). You may wonder but this is what's exactly done in PreciseDefaultSimilarity in TestLongNormValueSource. I think you can just use it. On Wed, Dec 10, 2014 at 12:11 PM, S.L simpleliving...@gmail.com wrote: Hi Ahmet, Is there already an implementation of the suggested work around ? Thanks. On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi, Default length norm is not best option for differentiating very short documents, like product names. Please see : http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec I suggest you to create an additional integer field, that holds number of tokens. You can populate it via update processor. And then penalise (using fuction queries) according to that field. This way you have more fine grained and flexible control over it. Ahmet On Tuesday, December 9, 2014 12:22 PM, S.L simpleliving...@gmail.com wrote: Hi , Mikhail Thanks , I looked at the explain and this is what I see for the two different documents in questions, they have identical scores even though the document 2 has a shorter productName field, I do not see any lenghtNorm related information in the explain. Also I am not exactly clear on what needs to be looked in the API ? *Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf= productNameps=1pf2= productNamepf3= productNamestopwords=truelowercaseOperators=true *productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory Unlocked * - *100%* 10.649221 sum of the following: - *10.58%* 1.1270299 sum of the following: - *2.1%* 0.22383358 productName:iphon - *3.47%* 0.36922288 productName:4 s - *5.01%* 0.53397346 productName:16 gb - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 - *27.79%* 2.959255 sum of the following: - *10.97%* 1.1680154 productName:iphon 4 s~1 - *16.82%* 1.7912396 productName:4 s 16 gb~1 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 *productName Apple iPhone 4S 16GB for Net10, No Contract, White* - *100%* 10.649221 sum of the following: - *10.58%* 1.1270299 sum of the following: - *2.1%* 0.22383358 productName:iphon - *3.47%* 0.36922288 productName:4 s - *5.01%* 0.53397346 productName:16 gb - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 - *27.79%* 2.959255 sum of the following: - *10.97%* 1.1680154 productName:iphon 4 s~1 - *16.82%* 1.7912396 productName:4 s 16 gb~1 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: It's worth to look into explain to check particular scoring values. But for most suspect is the reducing precision when float norms are stored in byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float) On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com wrote: I have two documents doc1 and doc2 and each one of those has a field called phoneName. doc1:phoneName:Details about Apple iPhone 4s - 16GB - White (Verizon) Smartphone Factory Unlocked doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White Here if I search for q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true Doc1 and Doc2 both have the same identical score , but since the field phoneName in the doc2 has shorter length I would expect it to have a higher score , but both have an identical score of 9.961212. The phoneName filed is defined as follows.As we can see no where am I specifying omitNorms=True, still the behavior seems to be that the length norm is not functioning at all. Can some one let me know whats the issue here ? field name=phoneName type=text_en_splitting indexed=true stored=true required=true / fieldType name=text_en_splitting class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / !-- in this example, we will only use
Re: Length norm not functioning in solr queries.
Ahmet, Thank you , as the configurations in SolrCloud are uploaded to zookeeper , are there any special steps that need to be taken to make this work in SolrCloud ? On Wed, Dec 10, 2014 at 4:32 AM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi, Or even better, you can use your new field for tie break purposes. Where scores are identical. e.g. sort=score desc, wordCount asc Ahmet On Wednesday, December 10, 2014 11:29 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi, You mean update processor factory? Here is augmented (wordCount field added) version of your example : doc1: phoneName:Details about Apple iPhone 4s - 16GB - White (Verizon) Smartphone Factory Unlocked wordCount: 11 doc2: phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White wordCount: 9 First task is simply calculate wordCount values. You can do it in your indexing code, or other places. I quickly skimmed existing update processors but I couldn't find stock implementation. CountFieldValuesUpdateProcessorFactory fooled me, but it looks like it is all about multivalued fields. I guess, A simple javascript that splits on whitespace and returns the produced array size would do the trick : StatelessScriptUpdateProcessorFactory At this point you have a int field named word count. boost=div(1,wordCount) should work. Or you can came up with more sophisticated math formula. Ahmet On Wednesday, December 10, 2014 11:12 AM, S.L simpleliving...@gmail.com wrote: Hi Ahmet, Is there already an implementation of the suggested work around ? Thanks. On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi, Default length norm is not best option for differentiating very short documents, like product names. Please see : http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec I suggest you to create an additional integer field, that holds number of tokens. You can populate it via update processor. And then penalise (using fuction queries) according to that field. This way you have more fine grained and flexible control over it. Ahmet On Tuesday, December 9, 2014 12:22 PM, S.L simpleliving...@gmail.com wrote: Hi , Mikhail Thanks , I looked at the explain and this is what I see for the two different documents in questions, they have identical scores even though the document 2 has a shorter productName field, I do not see any lenghtNorm related information in the explain. Also I am not exactly clear on what needs to be looked in the API ? *Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf= productNameps=1pf2= productNamepf3= productNamestopwords=truelowercaseOperators=true *productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory Unlocked * - *100%* 10.649221 sum of the following: - *10.58%* 1.1270299 sum of the following: - *2.1%* 0.22383358 productName:iphon - *3.47%* 0.36922288 productName:4 s - *5.01%* 0.53397346 productName:16 gb - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 - *27.79%* 2.959255 sum of the following: - *10.97%* 1.1680154 productName:iphon 4 s~1 - *16.82%* 1.7912396 productName:4 s 16 gb~1 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 *productName Apple iPhone 4S 16GB for Net10, No Contract, White* - *100%* 10.649221 sum of the following: - *10.58%* 1.1270299 sum of the following: - *2.1%* 0.22383358 productName:iphon - *3.47%* 0.36922288 productName:4 s - *5.01%* 0.53397346 productName:16 gb - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 - *27.79%* 2.959255 sum of the following: - *10.97%* 1.1680154 productName:iphon 4 s~1 - *16.82%* 1.7912396 productName:4 s 16 gb~1 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: It's worth to look into explain to check particular scoring values. But for most suspect is the reducing precision when float norms are stored in byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float) On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com wrote: I have two documents doc1 and doc2 and each one of those has a field called phoneName. doc1:phoneName:Details about Apple iPhone 4s - 16GB - White (Verizon) Smartphone Factory Unlocked doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White Here if I search for q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true Doc1 and Doc2 both have the same identical score , but since the field phoneName in the doc2 has shorter length I would expect it to have a higher score , but both have an identical
Re: Length norm not functioning in solr queries.
Yes, I understand that reindexing is neccesary , however for some reason I was not able to invoke the js script from the updateprocessor, so I ended up using Java only solution at index time. Thanks. On Thu, Dec 11, 2014 at 7:18 AM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi, No special steps to be taken for cloud setup. Please note that for both solutions, re-index is mandatory. Ahmet On Thursday, December 11, 2014 12:15 PM, S.L simpleliving...@gmail.com wrote: Ahmet, Thank you , as the configurations in SolrCloud are uploaded to zookeeper , are there any special steps that need to be taken to make this work in SolrCloud ? On Wed, Dec 10, 2014 at 4:32 AM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi, Or even better, you can use your new field for tie break purposes. Where scores are identical. e.g. sort=score desc, wordCount asc Ahmet On Wednesday, December 10, 2014 11:29 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi, You mean update processor factory? Here is augmented (wordCount field added) version of your example : doc1: phoneName:Details about Apple iPhone 4s - 16GB - White (Verizon) Smartphone Factory Unlocked wordCount: 11 doc2: phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White wordCount: 9 First task is simply calculate wordCount values. You can do it in your indexing code, or other places. I quickly skimmed existing update processors but I couldn't find stock implementation. CountFieldValuesUpdateProcessorFactory fooled me, but it looks like it is all about multivalued fields. I guess, A simple javascript that splits on whitespace and returns the produced array size would do the trick : StatelessScriptUpdateProcessorFactory At this point you have a int field named word count. boost=div(1,wordCount) should work. Or you can came up with more sophisticated math formula. Ahmet On Wednesday, December 10, 2014 11:12 AM, S.L simpleliving...@gmail.com wrote: Hi Ahmet, Is there already an implementation of the suggested work around ? Thanks. On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi, Default length norm is not best option for differentiating very short documents, like product names. Please see : http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec I suggest you to create an additional integer field, that holds number of tokens. You can populate it via update processor. And then penalise (using fuction queries) according to that field. This way you have more fine grained and flexible control over it. Ahmet On Tuesday, December 9, 2014 12:22 PM, S.L simpleliving...@gmail.com wrote: Hi , Mikhail Thanks , I looked at the explain and this is what I see for the two different documents in questions, they have identical scores even though the document 2 has a shorter productName field, I do not see any lenghtNorm related information in the explain. Also I am not exactly clear on what needs to be looked in the API ? *Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf= productNameps=1pf2= productNamepf3= productNamestopwords=truelowercaseOperators=true *productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory Unlocked * - *100%* 10.649221 sum of the following: - *10.58%* 1.1270299 sum of the following: - *2.1%* 0.22383358 productName:iphon - *3.47%* 0.36922288 productName:4 s - *5.01%* 0.53397346 productName:16 gb - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 - *27.79%* 2.959255 sum of the following: - *10.97%* 1.1680154 productName:iphon 4 s~1 - *16.82%* 1.7912396 productName:4 s 16 gb~1 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 *productName Apple iPhone 4S 16GB for Net10, No Contract, White* - *100%* 10.649221 sum of the following: - *10.58%* 1.1270299 sum of the following: - *2.1%* 0.22383358 productName:iphon - *3.47%* 0.36922288 productName:4 s - *5.01%* 0.53397346 productName:16 gb - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 - *27.79%* 2.959255 sum of the following: - *10.97%* 1.1680154 productName:iphon 4 s~1 - *16.82%* 1.7912396 productName:4 s 16 gb~1 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: It's worth to look into explain to check particular scoring values. But for most suspect is the reducing precision when float norms are stored in byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float) On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com wrote
Re: Length norm not functioning in solr queries.
Hi Ahmet, Is there already an implementation of the suggested work around ? Thanks. On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi, Default length norm is not best option for differentiating very short documents, like product names. Please see : http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec I suggest you to create an additional integer field, that holds number of tokens. You can populate it via update processor. And then penalise (using fuction queries) according to that field. This way you have more fine grained and flexible control over it. Ahmet On Tuesday, December 9, 2014 12:22 PM, S.L simpleliving...@gmail.com wrote: Hi , Mikhail Thanks , I looked at the explain and this is what I see for the two different documents in questions, they have identical scores even though the document 2 has a shorter productName field, I do not see any lenghtNorm related information in the explain. Also I am not exactly clear on what needs to be looked in the API ? *Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf= productNameps=1pf2= productNamepf3= productNamestopwords=truelowercaseOperators=true *productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory Unlocked * - *100%* 10.649221 sum of the following: - *10.58%* 1.1270299 sum of the following: - *2.1%* 0.22383358 productName:iphon - *3.47%* 0.36922288 productName:4 s - *5.01%* 0.53397346 productName:16 gb - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 - *27.79%* 2.959255 sum of the following: - *10.97%* 1.1680154 productName:iphon 4 s~1 - *16.82%* 1.7912396 productName:4 s 16 gb~1 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 *productName Apple iPhone 4S 16GB for Net10, No Contract, White* - *100%* 10.649221 sum of the following: - *10.58%* 1.1270299 sum of the following: - *2.1%* 0.22383358 productName:iphon - *3.47%* 0.36922288 productName:4 s - *5.01%* 0.53397346 productName:16 gb - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 - *27.79%* 2.959255 sum of the following: - *10.97%* 1.1680154 productName:iphon 4 s~1 - *16.82%* 1.7912396 productName:4 s 16 gb~1 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: It's worth to look into explain to check particular scoring values. But for most suspect is the reducing precision when float norms are stored in byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float) On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com wrote: I have two documents doc1 and doc2 and each one of those has a field called phoneName. doc1:phoneName:Details about Apple iPhone 4s - 16GB - White (Verizon) Smartphone Factory Unlocked doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White Here if I search for q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true Doc1 and Doc2 both have the same identical score , but since the field phoneName in the doc2 has shorter length I would expect it to have a higher score , but both have an identical score of 9.961212. The phoneName filed is defined as follows.As we can see no where am I specifying omitNorms=True, still the behavior seems to be that the length norm is not functioning at all. Can some one let me know whats the issue here ? field name=phoneName type=text_en_splitting indexed=true stored=true required=true / fieldType name=text_en_splitting class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 / filter class=solr.LowerCaseFilterFactory / filter class=solr.KeywordMarkerFilterFactory
Re: Length norm not functioning in solr queries.
Hi , Mikhail Thanks , I looked at the explain and this is what I see for the two different documents in questions, they have identical scores even though the document 2 has a shorter productName field, I do not see any lenghtNorm related information in the explain. Also I am not exactly clear on what needs to be looked in the API ? *Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf= productNameps=1pf2= productNamepf3= productNamestopwords=truelowercaseOperators=true *productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory Unlocked * - *100%* 10.649221 sum of the following: - *10.58%* 1.1270299 sum of the following: - *2.1%* 0.22383358 productName:iphon - *3.47%* 0.36922288 productName:4 s - *5.01%* 0.53397346 productName:16 gb - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 - *27.79%* 2.959255 sum of the following: - *10.97%* 1.1680154 productName:iphon 4 s~1 - *16.82%* 1.7912396 productName:4 s 16 gb~1 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 *productName Apple iPhone 4S 16GB for Net10, No Contract, White* - *100%* 10.649221 sum of the following: - *10.58%* 1.1270299 sum of the following: - *2.1%* 0.22383358 productName:iphon - *3.47%* 0.36922288 productName:4 s - *5.01%* 0.53397346 productName:16 gb - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 - *27.79%* 2.959255 sum of the following: - *10.97%* 1.1680154 productName:iphon 4 s~1 - *16.82%* 1.7912396 productName:4 s 16 gb~1 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1 On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: It's worth to look into explain to check particular scoring values. But for most suspect is the reducing precision when float norms are stored in byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float) On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com wrote: I have two documents doc1 and doc2 and each one of those has a field called phoneName. doc1:phoneName:Details about Apple iPhone 4s - 16GB - White (Verizon) Smartphone Factory Unlocked doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White Here if I search for q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true Doc1 and Doc2 both have the same identical score , but since the field phoneName in the doc2 has shorter length I would expect it to have a higher score , but both have an identical score of 9.961212. The phoneName filed is defined as follows.As we can see no where am I specifying omitNorms=True, still the behavior seems to be that the length norm is not functioning at all. Can some one let me know whats the issue here ? field name=phoneName type=text_en_splitting indexed=true stored=true required=true / fieldType name=text_en_splitting class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 / filter class=solr.LowerCaseFilterFactory / filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt / filter class=solr.PorterStemFilterFactory / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 / filter class=solr.LowerCaseFilterFactory / filter
Length norm not functioning in solr queries.
I have two documents doc1 and doc2 and each one of those has a field called phoneName. doc1:phoneName:Details about Apple iPhone 4s - 16GB - White (Verizon) Smartphone Factory Unlocked doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White Here if I search for q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true Doc1 and Doc2 both have the same identical score , but since the field phoneName in the doc2 has shorter length I would expect it to have a higher score , but both have an identical score of 9.961212. The phoneName filed is defined as follows.As we can see no where am I specifying omitNorms=True, still the behavior seems to be that the length norm is not functioning at all. Can some one let me know whats the issue here ? field name=phoneName type=text_en_splitting indexed=true stored=true required=true / fieldType name=text_en_splitting class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 / filter class=solr.LowerCaseFilterFactory / filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt / filter class=solr.PorterStemFilterFactory / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 / filter class=solr.LowerCaseFilterFactory / filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt / filter class=solr.PorterStemFilterFactory / /analyzer /fieldType
Re: Boosting the score using edismax for a non empty and non indexed field.
Anyone ? On Mon, Dec 8, 2014 at 2:45 AM, S.L simpleliving...@gmail.com wrote: Hi All, I have a situation where I need to boost the score of a query if a field (imageURL) in the given document is non empty , I am using edismax so I know that using bq parameter would solve the problem. However the field imageURL that I am trying to boost on is not indexed , meaning (stored = true and indexed = false), can I use the bq parameter for a non indexed field ? or should I be looking at re-indexing after changing the schema to make this an indexed field ? Also , my use case is such that I want the documents that have an imageURL to be boosted so that they appear before those documents that do not have the imageURL when sorted by score in a descending order, and this field in question i.e. imageURL is sometimes present and sometimes not, that is why I am looking at boosting the score of those documents that have the imageURL present. Thanks and any help and suggestionis much appreciated!
Boosting the score using edismax for a non empty and non indexed field.
Hi All, I have a situation where I need to boost the score of a query if a field (imageURL) in the given document is non empty , I am using edismax so I know that using bq parameter would solve the problem. However the field imageURL that I am trying to boost on is not indexed , meaning (stored = true and indexed = false), can I use the bq parameter for a non indexed field ? or should I be looking at re-indexing after changing the schema to make this an indexed field ? Also , my use case is such that I want the documents that have an imageURL to be boosted so that they appear before those documents that do not have the imageURL when sorted by score in a descending order, and this field in question i.e. imageURL is sometimes present and sometimes not, that is why I am looking at boosting the score of those documents that have the imageURL present. Thanks and any help and suggestionis much appreciated!
Re: Can we query on _version_field ?
Here is why I want to do this . 1. My unique key is a http URL, doctorURL. 2. If I do a look up based on URL , I am bound to face issues with character escaping and all. 3. To avoid that I was using a UUID for look up , but in SolrCloud it generates unique per replica , which is not acceptable. 4. Now I see that the mandatory _version_ field has a unique value per document and and not unique per replica , so I am exploring the use of _version_ to do a look up only and not neccesarily use it as a unique key, is it do able in that case ? On Thu, Nov 13, 2014 at 8:58 AM, Erick Erickson erickerick...@gmail.com wrote: Really, I have to ask why you would want to. This is really purely an internal thing. I don't know what practical value there would be to search on this? Interestingly, I can search _version_:[100 TO *], but specific searches seem to fail. I wonder if there's something wonky going on with searching on large longs here. Feels like an XY problem to me though. Best, Erick On Thu, Nov 13, 2014 at 12:45 AM, S.L simpleliving...@gmail.com wrote: Hi All, We know that _version_field is a mandatory field in solrcloud schema.xml, it is expected to be of type long , it also seems to have unique value in a collection. However the query of the form http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*fq=%28_version_:148463254894438%29wt=json does not seems to return any record , can we query on the _version_field in the schema.xml ? Thank you.
Re: Can we query on _version_field ?
Erick, 1._version_ will change on updates , shouldnt that be OK ?My understanding of update here means that the a new document will be inserted with the same unique key docUrl in my case ,which will replace the document effectively. This will not be an issue in my case because the initial search results based on doctorName, would have basic doctor data , and when that tile is clicked upon detail data would be displayed based on the lookup of the _version_ id. So if the _version_ does not change besides the update , I should be good , of course there is a possibility of the document being updated between the search results being displayed and detailed information being requested, but the possibility of that less in my case , because usually people request details as soon as the initial search results are displayed. 2. Yes,I have used UUIDUPdateProcessorFactory in the following ways , but none of them solve the issue , especially in SolrCloud. *Case 1:* *schema.xml* field name=id type=string indexed=true stored=true required=true multiValued=false / This does not generate the unique id at all. *Case 2:* field name=id type=uuid indexed=true stored=true required=true multiValued=false / In this case a unique id is generated , but that is unique for every replica and we end up with different ids for the same document in different replicas. In both the cases above the solrconfig.xml had the following entry. updateRequestProcessorChain name=uuid processor class=solr.UUIDUpdateProcessorFactory str name=fieldNameid/str /processor processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain On Thu, Nov 13, 2014 at 11:01 AM, Erick Erickson erickerick...@gmail.com wrote: _version_ will change on updates I'm pretty sure, so I doubt it's suitable. I _think_ you can use a UUIDUPdateProcessorFactory here. I haven't checked this personally, but the idea here is that the UUID cannot be assigned on the shard. But if you're checking this out, if the UUID is assigned _before_ the doc is sent to the destination shard, it should be fine. Have you checked that out? I'm at a conference, so I can't check it out too thoroughly right now... Best, Erick On Thu, Nov 13, 2014 at 10:18 AM, S.L simpleliving...@gmail.com wrote: Here is why I want to do this . 1. My unique key is a http URL, doctorURL. 2. If I do a look up based on URL , I am bound to face issues with character escaping and all. 3. To avoid that I was using a UUID for look up , but in SolrCloud it generates unique per replica , which is not acceptable. 4. Now I see that the mandatory _version_ field has a unique value per document and and not unique per replica , so I am exploring the use of _version_ to do a look up only and not neccesarily use it as a unique key, is it do able in that case ? On Thu, Nov 13, 2014 at 8:58 AM, Erick Erickson erickerick...@gmail.com wrote: Really, I have to ask why you would want to. This is really purely an internal thing. I don't know what practical value there would be to search on this? Interestingly, I can search _version_:[100 TO *], but specific searches seem to fail. I wonder if there's something wonky going on with searching on large longs here. Feels like an XY problem to me though. Best, Erick On Thu, Nov 13, 2014 at 12:45 AM, S.L simpleliving...@gmail.com wrote: Hi All, We know that _version_field is a mandatory field in solrcloud schema.xml, it is expected to be of type long , it also seems to have unique value in a collection. However the query of the form http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*fq=%28_version_:148463254894438%29wt=json does not seems to return any record , can we query on the _version_field in the schema.xml ? Thank you.
Re: Can we query on _version_field ?
I am not sure if this a case of XY problem. I have no control over the URLs to deduce an id from them , those are from www, I made the URL the uniqueKey , that way the document gets replaced when a new document with that URL comes in . To do the detail look up I can either use the same docURL as it is , or try and generate a unique id filed for each document. For the later option UUID is not behaving as expected in SolrCloud and _version_ field seems to be serving the need . On Thu, Nov 13, 2014 at 11:35 AM, Shawn Heisey apa...@elyograg.org wrote: On 11/12/2014 10:45 PM, S.L wrote: We know that _version_field is a mandatory field in solrcloud schema.xml, it is expected to be of type long , it also seems to have unique value in a collection. However the query of the form http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*fq=%28_version_:148463254894438%29wt=json does not seems to return any record , can we query on the _version_field in the schema.xml ? I've been watching your journey unfold on the mailing list. The whole thing seems like an XY problem. If I'm reading everything correctly, you want to have a unique ID value that can serve as the uniqueKey, as well as a way to quickly look up a single document in Solr. Is there one part of the URL that serves as a unique identifier that doesn't contain special characters? It seems insane that you would not have a unique ID value for every entity in your system that is composed of only regular characters. Assuming that such an ID exists (and is likely used as one piece of that doctorURL that you mentioned) ... if you can extract that ID value into its own field (either in your indexing code or a custom update processor), you could use that for both uniqueKey and single-document lookups. Having that kind of information in your index seems like a generally good idea. Thanks, Shawn
Re: Can we query on _version_field ?
Garth and Erick, I am now successfully able to auto generate ids using UUID updateRequestProcessorChain , by giving the id type of string . Thanks for your help folks. On Thu, Nov 13, 2014 at 1:31 PM, Garth Grimm garthgr...@averyranchconsulting.com wrote: So it sounds like you’re OK with using the docURL as the unique key for routing in SolrCloud, but you don’t want to use it as a lookup mechanism. If you don’t want to do a hash of it and use that unique value in a second unique field and feed time, and you can’t seem to find any other field that might be unique, and you don’t want to make your own UpdateRequestProcessorChain that would generate a unique field from your unique key (such as by doing an MD5 hash), you might look at the UpdateRequestProcessorChain named “deduce” in the OOB solrconfig.xml. It’s primarily designed to help dedupe results, but it’s technique is to concatenate multiple fields together to create a signature that will be unique in some way. So instead of having to find one field in your data that’s unique, you could look for a couple of fields that, if combined, would create a unique field, and configure the “dedupe” Processor to handle that. On Nov 13, 2014, at 12:02 PM, S.L simpleliving...@gmail.com wrote: I am not sure if this a case of XY problem. I have no control over the URLs to deduce an id from them , those are from www, I made the URL the uniqueKey , that way the document gets replaced when a new document with that URL comes in . To do the detail look up I can either use the same docURL as it is , or try and generate a unique id filed for each document. For the later option UUID is not behaving as expected in SolrCloud and _version_ field seems to be serving the need . On Thu, Nov 13, 2014 at 11:35 AM, Shawn Heisey apa...@elyograg.org wrote: On 11/12/2014 10:45 PM, S.L wrote: We know that _version_field is a mandatory field in solrcloud schema.xml, it is expected to be of type long , it also seems to have unique value in a collection. However the query of the form http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*fq=%28_version_:148463254894438%29wt=json does not seems to return any record , can we query on the _version_field in the schema.xml ? I've been watching your journey unfold on the mailing list. The whole thing seems like an XY problem. If I'm reading everything correctly, you want to have a unique ID value that can serve as the uniqueKey, as well as a way to quickly look up a single document in Solr. Is there one part of the URL that serves as a unique identifier that doesn't contain special characters? It seems insane that you would not have a unique ID value for every entity in your system that is composed of only regular characters. Assuming that such an ID exists (and is likely used as one piece of that doctorURL that you mentioned) ... if you can extract that ID value into its own field (either in your indexing code or a custom update processor), you could use that for both uniqueKey and single-document lookups. Having that kind of information in your index seems like a generally good idea. Thanks, Shawn
Re: Different ids for the same document in different replicas.
Thanks. So the issue here is I already have a uniqueKeydoctorIduniquekey defined in my schema.xml. If along with that I also want the id/id field to be automatically generated for each document do I have to declare it as a uniquekey as well , because I just tried the following setting without the uniqueKey for id and its only generating blank ids for me. *schema.xml* field name=id type=string indexed=true stored=true required=true multiValued=false / *solrconfig.xml* updateRequestProcessorChain name=uuid processor class=solr.UUIDUpdateProcessorFactory str name=fieldNameid/str /processor processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain On Tue, Nov 11, 2014 at 7:47 PM, Garth Grimm garthgr...@averyranchconsulting.com wrote: Looking a little deeper, I did find this about UUIDField http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/schema/UUIDField.html NOTE: Configuring a UUIDField instance with a default value of NEW is not advisable for most users when using SolrCloud (and not possible if the UUID value is configured as the unique key field) since the result will be that each replica of each document will get a unique UUID value. Using UUIDUpdateProcessorFactory http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/update/processor/UUIDUpdateProcessorFactory.html to generate UUID values when documents are added is recomended instead.” That might describe the behavior you saw. And the use of UUIDUpdateProcessorFactory to auto generate ID’s seems to be covered well here: http://solr.pl/en/2013/07/08/automatically-generate-document-identifiers-solr-4-x/ Though I’ve not actually tried that process before. On Nov 11, 2014, at 7:39 PM, Garth Grimm garthgr...@averyranchconsulting.commailto: garthgr...@averyranchconsulting.com wrote: “uuid” isn’t an out of the box field type that I’m familiar with. Generally, I’d stick with the out of the box advice of the schema.xml file, which includes things like…. !-- Only remove the id field if you have a very good reason to. While not strictly required, it is highly recommended. A uniqueKey is present in almost all Solr installations. See the uniqueKey declaration below where uniqueKey is set to id. -- field name=id type=string indexed=true stored=true required=true multiValued=false / and… !-- Field to use to determine and enforce document uniqueness. Unless this field is marked with required=false, it will be a required field -- uniqueKeyid/uniqueKey If you’re creating some key/value pair with uuid as the key as you feed documents in, and you know that the uuid values you’re creating are unique, just change the field name and unique key name from ‘id’ to ‘uuid’. Or change the key name you send in from ‘uuid’ to ‘id’. On Nov 11, 2014, at 7:18 PM, S.L simpleliving...@gmail.commailto: simpleliving...@gmail.com wrote: Hi All, I am seeing interesting behavior on the replicas , I have a single shard and 6 replicas and on SolrCloud 4.10.1 . I only have a small number of documents ~375 that are replicated across the six replicas . The interesting thing is that the same document has a different id in each one of those replicas . This is causing the fq(id:xyz) type queries to fail, depending on which replica the query goes to. I have specified the id field in the following manner in schema.xml, is it the right way to specifiy an auto generated id in SolrCloud ? field name=id type=uuid indexed=true stored=true required=true multiValued=false / Thanks.
Re: Different ids for the same document in different replicas.
Just tried adding uniqueKeyid/uniqueKey while keeping id type= string only blank ids are being generated ,looks like the id is being auto generated only if the the id is set to type uuid , but in case of SolrCloud this id will be unique per replica. Is there a way to generate a unique id both in case of SolrCloud with out using the uuid type or not having a per replica unique id? The uuid in question is of type . fieldType name=uuid class=solr.UUIDField indexed=true / On Wed, Nov 12, 2014 at 6:20 PM, S.L simpleliving...@gmail.com wrote: Thanks. So the issue here is I already have a uniqueKeydoctorIduniquekey defined in my schema.xml. If along with that I also want the id/id field to be automatically generated for each document do I have to declare it as a uniquekey as well , because I just tried the following setting without the uniqueKey for id and its only generating blank ids for me. *schema.xml* field name=id type=string indexed=true stored=true required=true multiValued=false / *solrconfig.xml* updateRequestProcessorChain name=uuid processor class=solr.UUIDUpdateProcessorFactory str name=fieldNameid/str /processor processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain On Tue, Nov 11, 2014 at 7:47 PM, Garth Grimm garthgr...@averyranchconsulting.com wrote: Looking a little deeper, I did find this about UUIDField http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/schema/UUIDField.html NOTE: Configuring a UUIDField instance with a default value of NEW is not advisable for most users when using SolrCloud (and not possible if the UUID value is configured as the unique key field) since the result will be that each replica of each document will get a unique UUID value. Using UUIDUpdateProcessorFactory http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/update/processor/UUIDUpdateProcessorFactory.html to generate UUID values when documents are added is recomended instead.” That might describe the behavior you saw. And the use of UUIDUpdateProcessorFactory to auto generate ID’s seems to be covered well here: http://solr.pl/en/2013/07/08/automatically-generate-document-identifiers-solr-4-x/ Though I’ve not actually tried that process before. On Nov 11, 2014, at 7:39 PM, Garth Grimm garthgr...@averyranchconsulting.commailto: garthgr...@averyranchconsulting.com wrote: “uuid” isn’t an out of the box field type that I’m familiar with. Generally, I’d stick with the out of the box advice of the schema.xml file, which includes things like…. !-- Only remove the id field if you have a very good reason to. While not strictly required, it is highly recommended. A uniqueKey is present in almost all Solr installations. See the uniqueKey declaration below where uniqueKey is set to id. -- field name=id type=string indexed=true stored=true required=true multiValued=false / and… !-- Field to use to determine and enforce document uniqueness. Unless this field is marked with required=false, it will be a required field -- uniqueKeyid/uniqueKey If you’re creating some key/value pair with uuid as the key as you feed documents in, and you know that the uuid values you’re creating are unique, just change the field name and unique key name from ‘id’ to ‘uuid’. Or change the key name you send in from ‘uuid’ to ‘id’. On Nov 11, 2014, at 7:18 PM, S.L simpleliving...@gmail.commailto: simpleliving...@gmail.com wrote: Hi All, I am seeing interesting behavior on the replicas , I have a single shard and 6 replicas and on SolrCloud 4.10.1 . I only have a small number of documents ~375 that are replicated across the six replicas . The interesting thing is that the same document has a different id in each one of those replicas . This is causing the fq(id:xyz) type queries to fail, depending on which replica the query goes to. I have specified the id field in the following manner in schema.xml, is it the right way to specifiy an auto generated id in SolrCloud ? field name=id type=uuid indexed=true stored=true required=true multiValued=false / Thanks.
Can we query on _version_field ?
Hi All, We know that _version_field is a mandatory field in solrcloud schema.xml, it is expected to be of type long , it also seems to have unique value in a collection. However the query of the form http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*fq=%28_version_:148463254894438%29wt=json does not seems to return any record , can we query on the _version_field in the schema.xml ? Thank you.
Different ids for the same document in different replicas.
Hi All, I am seeing interesting behavior on the replicas , I have a single shard and 6 replicas and on SolrCloud 4.10.1 . I only have a small number of documents ~375 that are replicated across the six replicas . The interesting thing is that the same document has a different id in each one of those replicas . This is causing the fq(id:xyz) type queries to fail, depending on which replica the query goes to. I have specified the id field in the following manner in schema.xml, is it the right way to specifiy an auto generated id in SolrCloud ? field name=id type=uuid indexed=true stored=true required=true multiValued=false / Thanks.
Re: Master Slave set up in Solr Cloud
Resending this as I might have not been clear in my earlier query. I want to use SolrCloud for everything except the replication , is it possible to set up the master-slave configuration using different Solr instances and still be able to use the sharding feature provided by SolrCloud ? On Thu, Oct 30, 2014 at 6:18 PM, S.L simpleliving...@gmail.com wrote: Hi All, As I previously reported due to no overlap in terms of the documets in the SolrCloud replicas of the index shards , I have turned off the replication and basically have there shards with a replication factor of 1. It obviously seems will not be scalable due to the fact that the same core will be indexed and queried at the same time as this is a long running indexing task. My questions is what options do I have to set up the replicas of the single per shard core outside of the SolrCloud replication factor mechanism because that does not seem to work for me ? Thanks.
Re: Missing Records
I am curious , how many shards do you have and whats the replication factor you are using ? On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke aj.le...@securitylabs.com wrote: Hi All, We have a SOLR cloud instance that has been humming along nicely for months. Last week we started experiencing missing records. Admin DIH Example: Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s) A *:* search claims that there are only 903,902 this is the first full index. Subsequent full indexes give the following counts for the *:* search 903,805 903,665 826,357 All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0, Processed: 903,993 (x/s) every time. ---records per second is variable I found an item that should be in the index but is not found in a search. Here are the referenced lines of the log file. DEBUG - 2014-10-30 15:10:51.160; org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE add{,id=750041421} {{params(debug=falseoptimize=trueindent=truecommit=trueclean=truewt=jsoncommand=full-importentity=adsverbose=false),defaults(config=data-config.xml)}} DEBUG - 2014-10-30 15:10:51.160; org.apache.solr.update.SolrCmdDistributor; sending update to http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0 add{,id=750041421} params:update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica1%2F --- there are 746 lines of log between entries --- DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire; [0x2][0xc3][0xe0]params[0xa2][0xe0].update.distrib(TOLEADER[0xe0],distrib.from?[0x17] http://192.168.20.57:8983/solr/inventory_shard1_replica1/[0xe0]delByQ[0x0][0xe0]'docsMap[0xe][0x13][0x10]8[0x8]?[0x80][0x0][0x0][0xe0]#Zip%51106[0xe0]-IsReelCentric[0x2][0xe0](HasPrice[0x1][0xe0]*Make_Lower'ski-doo[0xe0])StateName$Iowa[0xe0]-OriginalModel/Summit Highmark[0xe0]/VerticalSiteIDs!2[0xe0]-ClassBinaryIDp@[0xe0]#lat(42.48929[0xe0]-SubClassFacet01704|Snowmobiles[0xe0](FuelType%Other[0xe0]2DivisionName_Lower,recreational[0xe0]latlon042.4893,-96.3693[0xe0]*PhotoCount!8[0xe0](HasVideo[0x2][0xe0]ID)750041421[0xe0]Engine [0xe0]*ClassFacet.12|Snowmobiles[0xe0]$Make'Ski-Doo[0xe0]$City*Sioux City[0xe0]#lng*-96.369302[0xe0]-Certification!N[0xe0]0EmotionalTagline0162 Long Track [0xe0]*IsEnhanced[0x1][0xe0]*SubClassID$1704[0xe0](NetPrice$4500[0xe0]1IsInternetSpecial[0x2][0xe0](HasPhoto[0x1][0xe0]/DealerSortOrder!2[0xe0]+Description?VThis Bad boy will pull you through the deepest snow!With the 162 track and 1000cc of power you can fly up any hill!![0xe0],DealerRadius+8046.72[0xe0],Transmission [0xe0]*ModelFacet7Ski-Doo|Summit Highmark[0xe0]/DealerNameFacet9Certified Auto, Inc.|4150[0xe0])StateAbbrIA[0xe0])ClassName+Snowmobiles[0xe0](DealerID$4150[0xe0]AdCode$DX1Q[0xe0]*DealerName4Certified Auto, Inc.[0xe0])Condition$Used[0xe0]/Condition_Lower$used[0xe0]-ExteriorColor+Blue/Yellow[0xe0],DivisionName,Recreational[0xe0]$Trim(1000 SDI[0xe0](SourceID!1[0xe0]0HasAdEnhancement!0[0xe0]'ClassID12[0xe0].FuelType_Lower%other[0xe0]$Year$2005[0xe0]+DealerFacet?[0x8]4150|Certified Auto, Inc.|Sioux City|IA[0xe0],SubClassName+Snowmobiles[0xe0]%Model/Summit Highmark[0xe0])EntryDate42011-11-17T10:46:00Z[0xe0]+StockNumber000105[0xe0]+PriceRebate!0[0xe0]+Model_Lower/summit highmark[\n] What could be the issue and how does one fix this issue? Thanks so much and if more information is needed I have preserved the log files. AJ
Master Slave set up in Solr Cloud
Hi All, As I previously reported due to no overlap in terms of the documets in the SolrCloud replicas of the index shards , I have turned off the replication and basically have there shards with a replication factor of 1. It obviously seems will not be scalable due to the fact that the same core will be indexed and queried at the same time as this is a long running indexing task. My questions is what options do I have to set up the replicas of the single per shard core outside of the SolrCloud replication factor mechanism because that does not seem to work for me ? Thanks.
Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.
Will, I think in one of your other emails(which I am not able to find) you has asked if I was indexing directly from MapReduce jobs, yes I am indexing directly from the map task and that is done using SolrJ with a SolrCloudServer initialized with the ZK ensemble URLs.Do I need to use something like MapReducerIndexerTool , which I suupose writes to HDFS and that is in a subsequent step moved to Solr index ? If so why ? I dont use any softCommits and do autocommit every 15 seconds , the snippet in the configuration can be seen below. autoSoftCommit maxTime${solr. autoSoftCommit.maxTime:-1}/maxTime /autoSoftCommit autoCommit maxTime${solr.autoCommit.maxTime:15000}/maxTime openSearchertrue/openSearcher /autoCommit I looked at the localhost_access.log file , all the GET and POST requests have a sub-second response time. On Tue, Oct 28, 2014 at 2:06 AM, Will Martin wmartin...@gmail.com wrote: The easiest, and coarsest measure of response time [not service time in a distributed system] can be picked up in your localhost_access.log file. You're using tomcat write? Lookup AccessLogValve in the docs and server.xml. You can add configuration to report the payload and time to service the request without touching any code. Queueing theory is what Otis was talking about when he said you've saturated your environment. In AWS people just auto-scale up and don't worry about where the load comes from; its dumb if it happens more than 2 times. Capacity planning is tough, let's hope it doesn't disappear altogether. G'luck -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, October 27, 2014 9:25 PM To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Good point about ZK logs , I do see the following exceptions intermittently in the ZK log. 2014-10-27 06:54:14,621 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029 2014-10-27 07:00:06,697 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,725 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new session at /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,746 [myid:1] - INFO [CommitProcessor:1:ZooKeeperServer@617] - Established session 0x14949db9da40037 with negotiated timeout 1 for client /xxx.xxx.xxx.xxx:37336 2014-10-27 07:01:06,520 [myid:1] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14949db9da40037, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:744) For queuing theory , I dont know of any way to see how fasts the requests are being served by SolrCloud , and if a queue is being maintained if the service rate is slower than the rate of requests from the incoming multiple threads. On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com wrote: 2 naïve comments, of course. - Queuing theory - Zookeeper logs. From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, October 27, 2014 1:42 PM To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Please find the clusterstate.json attached. Also in this case atleast the Shard1 replicas are out of sync , as can be seen below. Shard 1 replica 1 *does not* return a result with distrib=false. Query :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=% 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebu g=trackshards.info=true fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=false debug=track shards.info=true Result : responselst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=q*:*/strstr name= shards.infotrue/strstr name=distribfalse/strstr name=debugtrack/strstr name=wtxml/strstr name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lst result name=response numFound=0 start=0/lst name=debug//response Shard1 replica 2 *does* return the result with distrib=false. Query: http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:* http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=% 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt
Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.
I m using Apache Hadoop and Solr , do I nee dto switch to Cloudera On Tue, Oct 28, 2014 at 1:27 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: We index directly from mappers using SolrJ. It does work, but you pay the price of having to instantiate all those sockets vs. the way MapReduceIndexerTool works, where you're writing to an EmbeddedSolrServer directly in the Reduce task. You don't *need* to use MapReduceIndexerTool, but it's more efficient, and if you don't, you then have to make sure to appropriately tune your Hadoop implementation to match what your Solr installation is capable of. On 10/28/14 12:39, S.L wrote: Will, I think in one of your other emails(which I am not able to find) you has asked if I was indexing directly from MapReduce jobs, yes I am indexing directly from the map task and that is done using SolrJ with a SolrCloudServer initialized with the ZK ensemble URLs.Do I need to use something like MapReducerIndexerTool , which I suupose writes to HDFS and that is in a subsequent step moved to Solr index ? If so why ? I dont use any softCommits and do autocommit every 15 seconds , the snippet in the configuration can be seen below. autoSoftCommit maxTime${solr. autoSoftCommit.maxTime:-1}/maxTime /autoSoftCommit autoCommit maxTime${solr.autoCommit.maxTime:15000}/maxTime openSearchertrue/openSearcher /autoCommit I looked at the localhost_access.log file , all the GET and POST requests have a sub-second response time. On Tue, Oct 28, 2014 at 2:06 AM, Will Martin wmartin...@gmail.com wrote: The easiest, and coarsest measure of response time [not service time in a distributed system] can be picked up in your localhost_access.log file. You're using tomcat write? Lookup AccessLogValve in the docs and server.xml. You can add configuration to report the payload and time to service the request without touching any code. Queueing theory is what Otis was talking about when he said you've saturated your environment. In AWS people just auto-scale up and don't worry about where the load comes from; its dumb if it happens more than 2 times. Capacity planning is tough, let's hope it doesn't disappear altogether. G'luck -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, October 27, 2014 9:25 PM To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Good point about ZK logs , I do see the following exceptions intermittently in the ZK log. 2014-10-27 06:54:14,621 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029 2014-10-27 07:00:06,697 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,725 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new session at /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,746 [myid:1] - INFO [CommitProcessor:1:ZooKeeperServer@617] - Established session 0x14949db9da40037 with negotiated timeout 1 for client /xxx.xxx.xxx.xxx:37336 2014-10-27 07:01:06,520 [myid:1] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14949db9da40037, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run( NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:744) For queuing theory , I dont know of any way to see how fasts the requests are being served by SolrCloud , and if a queue is being maintained if the service rate is slower than the rate of requests from the incoming multiple threads. On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com wrote: 2 naïve comments, of course. - Queuing theory - Zookeeper logs. From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, October 27, 2014 1:42 PM To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Please find the clusterstate.json attached. Also in this case atleast the Shard1 replicas are out of sync , as can be seen below. Shard 1 replica 1 *does not* return a result with distrib=false. Query :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=% 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebu g=trackshards.info=true fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib
Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.
Yeah , I get that not using a MarReduceIndexerTool could be more resource intensive , but the way this issue is manifesting which is resulting in disjoint SolrCloud replicas perplexes me . While you were tuning your SolrCloud environment to cater to the Hadoop indexing requirements , did you ever face the issue of disjoint replicas? Is MapReduceIndexer tool Cloudera distro specific? I am using Apache Solr and Hadoop. Thanks On Tue, Oct 28, 2014 at 1:27 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: We index directly from mappers using SolrJ. It does work, but you pay the price of having to instantiate all those sockets vs. the way MapReduceIndexerTool works, where you're writing to an EmbeddedSolrServer directly in the Reduce task. You don't *need* to use MapReduceIndexerTool, but it's more efficient, and if you don't, you then have to make sure to appropriately tune your Hadoop implementation to match what your Solr installation is capable of. On 10/28/14 12:39, S.L wrote: Will, I think in one of your other emails(which I am not able to find) you has asked if I was indexing directly from MapReduce jobs, yes I am indexing directly from the map task and that is done using SolrJ with a SolrCloudServer initialized with the ZK ensemble URLs.Do I need to use something like MapReducerIndexerTool , which I suupose writes to HDFS and that is in a subsequent step moved to Solr index ? If so why ? I dont use any softCommits and do autocommit every 15 seconds , the snippet in the configuration can be seen below. autoSoftCommit maxTime${solr. autoSoftCommit.maxTime:-1}/maxTime /autoSoftCommit autoCommit maxTime${solr.autoCommit.maxTime:15000}/maxTime openSearchertrue/openSearcher /autoCommit I looked at the localhost_access.log file , all the GET and POST requests have a sub-second response time. On Tue, Oct 28, 2014 at 2:06 AM, Will Martin wmartin...@gmail.com wrote: The easiest, and coarsest measure of response time [not service time in a distributed system] can be picked up in your localhost_access.log file. You're using tomcat write? Lookup AccessLogValve in the docs and server.xml. You can add configuration to report the payload and time to service the request without touching any code. Queueing theory is what Otis was talking about when he said you've saturated your environment. In AWS people just auto-scale up and don't worry about where the load comes from; its dumb if it happens more than 2 times. Capacity planning is tough, let's hope it doesn't disappear altogether. G'luck -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, October 27, 2014 9:25 PM To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Good point about ZK logs , I do see the following exceptions intermittently in the ZK log. 2014-10-27 06:54:14,621 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029 2014-10-27 07:00:06,697 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,725 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new session at /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,746 [myid:1] - INFO [CommitProcessor:1:ZooKeeperServer@617] - Established session 0x14949db9da40037 with negotiated timeout 1 for client /xxx.xxx.xxx.xxx:37336 2014-10-27 07:01:06,520 [myid:1] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14949db9da40037, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run( NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:744) For queuing theory , I dont know of any way to see how fasts the requests are being served by SolrCloud , and if a queue is being maintained if the service rate is slower than the rate of requests from the incoming multiple threads. On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com wrote: 2 naïve comments, of course. - Queuing theory - Zookeeper logs. From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, October 27, 2014 1:42 PM To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Please find the clusterstate.json attached. Also in this case atleast the Shard1 replicas are out of sync , as can be seen below. Shard
Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.
Thank Otis, I have checked the logs , in my case the default catalina.out and I dont see any OOMs or , any other exceptions. What others metrics do you suggest ? On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, You may simply be overwhelming your cluster-nodes. Have you checked various metrics to see if that is the case? Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote: Folks, I have posted previously about this , I am using SolrCloud 4.10.1 and have a sharded collection with 6 nodes , 3 shards and a replication factor of 2. I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that can each have upto 5 threds each , so the load on the indexing side can get to as high as 75 concurrent threads. I am facing an issue where the replicas of a particular shard(s) are consistently getting out of synch , initially I thought this was beccause I was using a custom component , but I did a fresh install and removed the custom component and reindexed using the Hadoop job , I still see the same behavior. I do not see any exceptions in my catalina.out , like OOM , or any other excepitions, I suspecting thi scould be because of the multi-threaded indexing nature of the Hadoop job . I use CloudSolrServer from my java code to index and initialize the CloudSolrServer using a 3 node ZK ensemble. Does any one know of any known issues with a highly multi-threaded indexing and SolrCloud ? Can someone help ? This issue has been slowing things down on my end for a while now. Thanks and much appreciated!
Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.
Markus, I would like to ignore it too, but whats happening is that the there is a lot of discrepancy between the replicas , queries like q=*:*fq=(id:220a8dce-3b31-4d46-8386-da8405595c47) fail depending on which replica the request goes to, because of huge amount of discrepancy between the replicas. Thank you for confirming that it is a know issue , I was thinking I was the only one facing this due to my set up. On Mon, Oct 27, 2014 at 11:31 AM, Markus Jelsma markus.jel...@openindex.io wrote: It is an ancient issue. One of the major contributors to the issue was resolved some versions ago but we are still seeing it sometimes too, there is nothing to see in the logs. We ignore it and just reindex. -Original message- From:S.L simpleliving...@gmail.com Sent: Monday 27th October 2014 16:25 To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Thank Otis, I have checked the logs , in my case the default catalina.out and I dont see any OOMs or , any other exceptions. What others metrics do you suggest ? On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, You may simply be overwhelming your cluster-nodes. Have you checked various metrics to see if that is the case? Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote: Folks, I have posted previously about this , I am using SolrCloud 4.10.1 and have a sharded collection with 6 nodes , 3 shards and a replication factor of 2. I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that can each have upto 5 threds each , so the load on the indexing side can get to as high as 75 concurrent threads. I am facing an issue where the replicas of a particular shard(s) are consistently getting out of synch , initially I thought this was beccause I was using a custom component , but I did a fresh install and removed the custom component and reindexed using the Hadoop job , I still see the same behavior. I do not see any exceptions in my catalina.out , like OOM , or any other excepitions, I suspecting thi scould be because of the multi-threaded indexing nature of the Hadoop job . I use CloudSolrServer from my java code to index and initialize the CloudSolrServer using a 3 node ZK ensemble. Does any one know of any known issues with a highly multi-threaded indexing and SolrCloud ? Can someone help ? This issue has been slowing things down on my end for a while now. Thanks and much appreciated!
Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.
One is not smaller than the other, because the numDocs is same for both replicas and essentially they seem to be disjoint sets. Also manually purging the replicas is not option , because this is frequently indexed index and we need everything to be automated. What other options do I have now. 1. Turn of the replication completely in SolrCloud 2. Use traditional Master Slave replication model. 3. Introduce a replica aware field in the index , to figure out which replica the request should go to from the client. 4. Try a distribution like Helios to see if it has any different behavior. Just think out loud here .. On Mon, Oct 27, 2014 at 11:56 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - if there is a very large discrepancy, you could consider to purge the smallest replica, it will then resync from the leader. -Original message- From:S.L simpleliving...@gmail.com Sent: Monday 27th October 2014 16:41 To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Markus, I would like to ignore it too, but whats happening is that the there is a lot of discrepancy between the replicas , queries like q=*:*fq=(id:220a8dce-3b31-4d46-8386-da8405595c47) fail depending on which replica the request goes to, because of huge amount of discrepancy between the replicas. Thank you for confirming that it is a know issue , I was thinking I was the only one facing this due to my set up. On Mon, Oct 27, 2014 at 11:31 AM, Markus Jelsma markus.jel...@openindex.io wrote: It is an ancient issue. One of the major contributors to the issue was resolved some versions ago but we are still seeing it sometimes too, there is nothing to see in the logs. We ignore it and just reindex. -Original message- From:S.L simpleliving...@gmail.com Sent: Monday 27th October 2014 16:25 To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Thank Otis, I have checked the logs , in my case the default catalina.out and I dont see any OOMs or , any other exceptions. What others metrics do you suggest ? On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, You may simply be overwhelming your cluster-nodes. Have you checked various metrics to see if that is the case? Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote: Folks, I have posted previously about this , I am using SolrCloud 4.10.1 and have a sharded collection with 6 nodes , 3 shards and a replication factor of 2. I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that can each have upto 5 threds each , so the load on the indexing side can get to as high as 75 concurrent threads. I am facing an issue where the replicas of a particular shard(s) are consistently getting out of synch , initially I thought this was beccause I was using a custom component , but I did a fresh install and removed the custom component and reindexed using the Hadoop job , I still see the same behavior. I do not see any exceptions in my catalina.out , like OOM , or any other excepitions, I suspecting thi scould be because of the multi-threaded indexing nature of the Hadoop job . I use CloudSolrServer from my java code to index and initialize the CloudSolrServer using a 3 node ZK ensemble. Does any one know of any known issues with a highly multi-threaded indexing and SolrCloud ? Can someone help ? This issue has been slowing things down on my end for a while now. Thanks and much appreciated!
Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.
Please find the clusterstate.json attached. Also in this case *atleast *the Shard1 replicas are out of sync , as can be seen below. *Shard 1 replica 1 *does not* return a result with distrib=false.* *Query :* http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=trackshards.info=true *Result :* responselst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=q*:*/strstr name= shards.infotrue/strstr name=distribfalse/strstr name=debugtrack/strstr name=wtxml/strstr name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lstresult name=response numFound=0 start=0/lst name=debug//response *Shard1 replica 2 *does* return the result with distrib=false.* *Query:* http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=trackshards.info=true *Result:* responselst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=q*:*/strstr name= shards.infotrue/strstr name=distribfalse/strstr name=debugtrack/strstr name=wtxml/strstr name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lstresult name=response numFound=1 start=0docstr name=thingURL http://www.xyz.com/strstr name=id9f4748c0-fe16-4632-b74e-4fee6b80cbf5/strlong name=_version_1483135330558148608/long/doc/resultlst name=debug//response On Mon, Oct 27, 2014 at 12:19 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Mon, Oct 27, 2014 at 9:40 PM, S.L simpleliving...@gmail.com wrote: One is not smaller than the other, because the numDocs is same for both replicas and essentially they seem to be disjoint sets. That is strange. Can we see your clusterstate.json? With that, please also specify the two replicas which are out of sync. Also manually purging the replicas is not option , because this is frequently indexed index and we need everything to be automated. What other options do I have now. 1. Turn of the replication completely in SolrCloud 2. Use traditional Master Slave replication model. 3. Introduce a replica aware field in the index , to figure out which replica the request should go to from the client. 4. Try a distribution like Helios to see if it has any different behavior. Just think out loud here .. On Mon, Oct 27, 2014 at 11:56 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - if there is a very large discrepancy, you could consider to purge the smallest replica, it will then resync from the leader. -Original message- From:S.L simpleliving...@gmail.com Sent: Monday 27th October 2014 16:41 To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Markus, I would like to ignore it too, but whats happening is that the there is a lot of discrepancy between the replicas , queries like q=*:*fq=(id:220a8dce-3b31-4d46-8386-da8405595c47) fail depending on which replica the request goes to, because of huge amount of discrepancy between the replicas. Thank you for confirming that it is a know issue , I was thinking I was the only one facing this due to my set up. On Mon, Oct 27, 2014 at 11:31 AM, Markus Jelsma markus.jel...@openindex.io wrote: It is an ancient issue. One of the major contributors to the issue was resolved some versions ago but we are still seeing it sometimes too, there is nothing to see in the logs. We ignore it and just reindex. -Original message- From:S.L simpleliving...@gmail.com Sent: Monday 27th October 2014 16:25 To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Thank Otis, I have checked the logs , in my case the default catalina.out and I dont see any OOMs or , any other exceptions. What others metrics do you suggest ? On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, You may simply be overwhelming your cluster-nodes. Have you checked various metrics to see if that is the case? Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote: Folks, I have posted previously about this , I am using SolrCloud 4.10.1 and have a sharded collection with 6 nodes , 3 shards and a replication factor of 2. I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that can each have upto 5
Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.
Good point about ZK logs , I do see the following exceptions intermittently in the ZK log. 2014-10-27 06:54:14,621 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029 2014-10-27 07:00:06,697 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,725 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new session at /xxx.xxx.xxx.xxx:37336 2014-10-27 07:00:06,746 [myid:1] - INFO [CommitProcessor:1:ZooKeeperServer@617] - Established session 0x14949db9da40037 with negotiated timeout 1 for client /xxx.xxx.xxx.xxx:37336 2014-10-27 07:01:06,520 [myid:1] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x14949db9da40037, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:744) For queuing theory , I dont know of any way to see how fasts the requests are being served by SolrCloud , and if a queue is being maintained if the service rate is slower than the rate of requests from the incoming multiple threads. On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com wrote: 2 naïve comments, of course. - Queuing theory - Zookeeper logs. From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, October 27, 2014 1:42 PM To: solr-user@lucene.apache.org Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch. Please find the clusterstate.json attached. Also in this case atleast the Shard1 replicas are out of sync , as can be seen below. Shard 1 replica 1 *does not* return a result with distrib=false. Query :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=trackshards.info=true fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=track shards.info=true Result : responselst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=q*:*/strstr name= shards.infotrue/strstr name=distribfalse/strstr name=debugtrack/strstr name=wtxml/strstr name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lstresult name=response numFound=0 start=0/lst name=debug//response Shard1 replica 2 *does* return the result with distrib=false. Query: http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:* http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=trackshards.info=true fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=track shards.info=true Result: responselst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=q*:*/strstr name= shards.infotrue/strstr name=distribfalse/strstr name=debugtrack/strstr name=wtxml/strstr name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lstresult name=response numFound=1 start=0docstr name=thingURL http://www.xyz.com/strstr name=id9f4748c0-fe16-4632-b74e-4fee6b80cbf5/strlong name=_version_1483135330558148608/long/doc/resultlst name=debug//response On Mon, Oct 27, 2014 at 12:19 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Mon, Oct 27, 2014 at 9:40 PM, S.L simpleliving...@gmail.com wrote: One is not smaller than the other, because the numDocs is same for both replicas and essentially they seem to be disjoint sets. That is strange. Can we see your clusterstate.json? With that, please also specify the two replicas which are out of sync. Also manually purging the replicas is not option , because this is frequently indexed index and we need everything to be automated. What other options do I have now. 1. Turn of the replication completely in SolrCloud 2. Use traditional Master Slave replication model. 3. Introduce a replica aware field in the index , to figure out which replica the request should go to from the client. 4. Try a distribution like Helios to see if it has any different behavior. Just think out loud here .. On Mon, Oct 27, 2014 at 11:56 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - if there is a very large discrepancy, you could consider to purge the smallest replica, it will then resync from the leader. -Original message- From:S.L simpleliving...@gmail.com Sent: Monday 27th October 2014 16:41
Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.
Folks, I have posted previously about this , I am using SolrCloud 4.10.1 and have a sharded collection with 6 nodes , 3 shards and a replication factor of 2. I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that can each have upto 5 threds each , so the load on the indexing side can get to as high as 75 concurrent threads. I am facing an issue where the replicas of a particular shard(s) are consistently getting out of synch , initially I thought this was beccause I was using a custom component , but I did a fresh install and removed the custom component and reindexed using the Hadoop job , I still see the same behavior. I do not see any exceptions in my catalina.out , like OOM , or any other excepitions, I suspecting thi scould be because of the multi-threaded indexing nature of the Hadoop job . I use CloudSolrServer from my java code to index and initialize the CloudSolrServer using a 3 node ZK ensemble. Does any one know of any known issues with a highly multi-threaded indexing and SolrCloud ? Can someone help ? This issue has been slowing things down on my end for a while now. Thanks and much appreciated!
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
new directory for /opt/solr/home1/dyCollection1_shard2_replica1/data/index.2014101839463 471654573 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller – Starting download to NRTCachingDirectory(MMapDirectory@/opt/solr/home1/dyCollection1_shard2_replica1/data/index.2014101839463 lockFactory=NativeFSLockFactory@/opt/solr/home1/dyCollection1_shard2_replica1/data/index.2014101839463; maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true 471834454 [zkCallback-2-thread-12] INFO org.apache.solr.common.cloud.ZkStateReader – A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 6) 471897454 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller – Total time taken for download : 243 secs 471898551 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller – New index installed. Updating index properties... index=index.2014101839463 471898932 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller – removing old index directory NRTCachingDirectory(MMapDirectory@/opt/solr/home1/dyCollection1_shard2_replica1/data/index lockFactory=NativeFSLockFactory@/opt/solr/home1/dyCollection1_shard2_replica1/data/index; maxCacheMB=48.0 maxMergeSizeMB=4.0) 471898932 [RecoveryThread] INFO org.apache.solr.update.DefaultSolrCoreState – Creating new IndexWriter... 471898934 [RecoveryThread] INFO org.apache.solr.update.DefaultSolrCoreState – Waiting until IndexWriter is unused... core=dyCollection1_shard2_replica1 471898934 [RecoveryThread] INFO org.apache.solr.update.DefaultSolrCoreState – Rollback old IndexWriter... core=dyCollection1_shard2_replica1 471904192 [RecoveryThread] INFO org.apache.solr.core.SolrCore – New index directory detected: old=/opt/solr/home1/dyCollection1_shard2_replica1/data/index/ new=/opt/solr/home1/dyCollection1_shard2_replica1/data/index.2014101839463 471904907 [RecoveryThread] INFO org.apache.solr.core.SolrCore – SolrDeletionPolicy.onInit: commits: num=1 commit{dir=NRTCachingDirectory(MMapDirectory@/opt/solr/home1/dyCollection1_shard2_replica1/data/index.2014101839463 lockFactory=NativeFSLockFactory@/opt/solr/home1/dyCollection1_shard2_replica1/data/index.2014101839463; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_88t,generation=10685} 471904907 [RecoveryThread] INFO org.apache.solr.core.SolrCore – newest commit generation = 10685 On Fri, Oct 17, 2014 at 1:12 PM, S.L simpleliving...@gmail.com wrote: Shawn, Just wondering if you have any other suggestions on what the next steps whould be ? Thanks. On Thu, Oct 16, 2014 at 11:12 PM, S.L simpleliving...@gmail.com wrote: Shawn , 1. I will upgrade to 67 JVM shortly . 2. This is a new collection as , I was facing a similar issue in 4.7 and based on Erick's recommendation I updated to 4.10.1 and created a new collection. 3. Yes, I am hitting the replicas of the same shard and I see the lists are completely non overlapping.I am using CloudSolrServer to add the documents. 4. I have a 3 physical node cluster , with each having 16GB in memory. 5. I also have a custom request handler defined in my solrconfig.xml as below , however I am not using that and I am only using the default select handler, but my MyCustomHandler class has been been added to the source and included in the build , but not being used for any requests yet. requestHandler name=/mycustomselect class=solr.MyCustomHandler startup=lazy lst name=defaults str name=dfsuggestAggregate/str str name=spellcheck.dictionarydirect/str !--str name=spellcheck.dictionarywordbreak/str-- str name=spellcheckon/str str name=spellcheck.extendedResultstrue/str str name=spellcheck.count10/str str name=spellcheck.alternativeTermCount5/str str name=spellcheck.maxResultsForSuggest5/str str name=spellcheck.collatetrue/str str name=spellcheck.collateExtendedResultstrue/str str name=spellcheck.maxCollationTries10/str str name=spellcheck.maxCollations5/str /lst arr name=last-components strspellcheck/str /arr /requestHandler 5. The clusterstate.json is copied below {dyCollection1:{ shards:{ shard1:{ range:8000-d554, state:active, replicas:{ core_node3:{ state:active, core:dyCollection1_shard1_replica1, node_name:server3.mydomain.com:8082_solr, base_url:http://server3.mydomain.com:8082/solr}, core_node4:{ state:active, core:dyCollection1_shard1_replica2, node_name:server2.mydomain.com:8081_solr, base_url:http://server2.mydomain.com:8081/solr;, leader:true}}}, shard2:{ range:d555-2aa9, state:active, replicas:{ core_node1
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Shawn, Just wondering if you have any other suggestions on what the next steps whould be ? Thanks. On Thu, Oct 16, 2014 at 11:12 PM, S.L simpleliving...@gmail.com wrote: Shawn , 1. I will upgrade to 67 JVM shortly . 2. This is a new collection as , I was facing a similar issue in 4.7 and based on Erick's recommendation I updated to 4.10.1 and created a new collection. 3. Yes, I am hitting the replicas of the same shard and I see the lists are completely non overlapping.I am using CloudSolrServer to add the documents. 4. I have a 3 physical node cluster , with each having 16GB in memory. 5. I also have a custom request handler defined in my solrconfig.xml as below , however I am not using that and I am only using the default select handler, but my MyCustomHandler class has been been added to the source and included in the build , but not being used for any requests yet. requestHandler name=/mycustomselect class=solr.MyCustomHandler startup=lazy lst name=defaults str name=dfsuggestAggregate/str str name=spellcheck.dictionarydirect/str !--str name=spellcheck.dictionarywordbreak/str-- str name=spellcheckon/str str name=spellcheck.extendedResultstrue/str str name=spellcheck.count10/str str name=spellcheck.alternativeTermCount5/str str name=spellcheck.maxResultsForSuggest5/str str name=spellcheck.collatetrue/str str name=spellcheck.collateExtendedResultstrue/str str name=spellcheck.maxCollationTries10/str str name=spellcheck.maxCollations5/str /lst arr name=last-components strspellcheck/str /arr /requestHandler 5. The clusterstate.json is copied below {dyCollection1:{ shards:{ shard1:{ range:8000-d554, state:active, replicas:{ core_node3:{ state:active, core:dyCollection1_shard1_replica1, node_name:server3.mydomain.com:8082_solr, base_url:http://server3.mydomain.com:8082/solr}, core_node4:{ state:active, core:dyCollection1_shard1_replica2, node_name:server2.mydomain.com:8081_solr, base_url:http://server2.mydomain.com:8081/solr;, leader:true}}}, shard2:{ range:d555-2aa9, state:active, replicas:{ core_node1:{ state:active, core:dyCollection1_shard2_replica1, node_name:server1.mydomain.com:8081_solr, base_url:http://server1.mydomain.com:8081/solr;, leader:true}, core_node6:{ state:active, core:dyCollection1_shard2_replica2, node_name:server3.mydomain.com:8081_solr, base_url:http://server3.mydomain.com:8081/solr}}}, shard3:{ range:2aaa-7fff, state:active, replicas:{ core_node2:{ state:active, core:dyCollection1_shard3_replica2, node_name:server1.mydomain.com:8082_solr, base_url:http://server1.mydomain.com:8082/solr;, leader:true}, core_node5:{ state:active, core:dyCollection1_shard3_replica1, node_name:server2.mydomain.com:8082_solr, base_url:http://server2.mydomain.com:8082/solr, maxShardsPerNode:1, router:{name:compositeId}, replicationFactor:2, autoAddReplicas:false}} Thanks! On Thu, Oct 16, 2014 at 9:02 PM, Shawn Heisey apa...@elyograg.org wrote: On 10/16/2014 6:27 PM, S.L wrote: 1. Java Version :java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) I believe that build 51 is one of those that is known to have bugs related to Lucene. If you can upgrade this to 67, that would be good, but I don't know that it's a pressing matter. It looks like the Oracle JVM, which is good. 2.OS CentOS Linux release 7.0.1406 (Core) 3. Everything is 64 bit , OS , Java , and CPU. 4. Java Args. -Djava.io.tmpdir=/opt/tomcat1/temp -Dcatalina.home=/opt/tomcat1 -Dcatalina.base=/opt/tomcat1 -Djava.endorsed.dirs=/opt/tomcat1/endorsed -DzkHost=server1.mydomain.com:2181,server2.mydomain.com:2181, server3.mydomain.com:2181 -DzkClientTimeout=2 -DhostContext=solr -Dport=8081 -Dhost=server1.mydomain.com -Dsolr.solr.home=/opt/solr/home1 -Dfile.encoding=UTF8 -Duser.timezone=UTC -XX:+UseG1GC -XX:MaxPermSize=128m -XX:PermSize=64m -Xmx2048m -Xms128m -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.util.logging.config.file=/opt/tomcat1/conf/ logging.properties I would not use the G1 collector myself, but with the heap
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Shawn, Please find the answers to your questions. 1. Java Version :java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) 2.OS CentOS Linux release 7.0.1406 (Core) 3. Everything is 64 bit , OS , Java , and CPU. 4. Java Args. -Djava.io.tmpdir=/opt/tomcat1/temp -Dcatalina.home=/opt/tomcat1 -Dcatalina.base=/opt/tomcat1 -Djava.endorsed.dirs=/opt/tomcat1/endorsed -DzkHost=server1.mydomain.com:2181,server2.mydomain.com:2181, server3.mydomain.com:2181 -DzkClientTimeout=2 -DhostContext=solr -Dport=8081 -Dhost=server1.mydomain.com -Dsolr.solr.home=/opt/solr/home1 -Dfile.encoding=UTF8 -Duser.timezone=UTC -XX:+UseG1GC -XX:MaxPermSize=128m -XX:PermSize=64m -Xmx2048m -Xms128m -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.util.logging.config.file=/opt/tomcat1/conf/logging.properties 5. Zookeeper ensemble has 3 zookeeper instances , which are external and are not embedded. 6. Container : I am using Tomcat Apache Tomcat Version 7.0.42 *Additional Observations:* I queries all docs on both replicas with distrib=falsefl=idsort=id+asc, then compared the two lists, I could see by eyeballing the first few lines of ids in both the lists ,I could say that even though each list has equal number of documents i.e 96309 each , but the document ids in them seem to be *mutually exclusive* , , I did not find even a single common id in those lists , I tried at least 15 manually ,it looks like to me that the replicas are disjoint sets. Thanks. On Thu, Oct 16, 2014 at 1:41 AM, Shawn Heisey apa...@elyograg.org wrote: On 10/15/2014 10:24 PM, S.L wrote: Yes , I tried those two queries with distrib=false , I get 0 results for first and 1 result for the second query( (i.e. server 3 shard 2 replica 2) consistently. However if I run the same second query (i.e. server 3 shard 2 replica 2) with distrib=true, I sometimes get a result and sometimes not , should'nt this query always return a result when its pointing to a core that seems to have that document regardless of distrib=true or false ? Unfortunately I dont see anything particular in the logs to point to any information. BTW you asked me to replace the request handler , I use the select request handler ,so I cannot replace it with anything else , is that a problem ? If you send the query with distrib=true (which is the default value in SolrCloud), then it treats it just as if you had sent it to /solr/collection instead of /solr/collection_shardN_replicaN, so it's a full distributed query. The distrib=false is required to turn that behavior off and ONLY query the index on the actual core where you sent it. I only said to replace those things as appropriate. Since you are using /select, it's no problem that you left it that way. If I were to assume that you used /select, but you didn't, the URLs as I wrote them might not have worked. As discussed, this means that your replicas are truly out of sync. It's difficult to know what caused it, especially if you can't see anything in the log when you indexed the missing documents. We know you're on Solr 4.10.1. This means that your Java is a 1.7 version, since Java7 is required. Here's where I ask a whole lot of questions about your setup. What is the precise Java version, and which vendor's Java are you using? What operating system is it on? Is everything 64-bit, or is any piece (CPU, OS, Java) 32-bit? On the Solr admin UI dashboard, it lists all parameters used when starting Java, labelled as Args. Can you include those? Is zookeeper external, or embedded in Solr? Is it a 3-server (or more) ensemble? Are you using the example jetty, or did you provide your own servlet container? We recommend 64-bit Oracle Java, the latest 1.7 version. OpenJDK (since version 1.7.x) should be pretty safe as well, but IBM's Java should be avoided. IBM does very aggressive runtime optimizations. These can make programs run faster, but they are known to negatively affect Lucene/Solr. Thanks, Shawn
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Shawn , 1. I will upgrade to 67 JVM shortly . 2. This is a new collection as , I was facing a similar issue in 4.7 and based on Erick's recommendation I updated to 4.10.1 and created a new collection. 3. Yes, I am hitting the replicas of the same shard and I see the lists are completely non overlapping.I am using CloudSolrServer to add the documents. 4. I have a 3 physical node cluster , with each having 16GB in memory. 5. I also have a custom request handler defined in my solrconfig.xml as below , however I am not using that and I am only using the default select handler, but my MyCustomHandler class has been been added to the source and included in the build , but not being used for any requests yet. requestHandler name=/mycustomselect class=solr.MyCustomHandler startup=lazy lst name=defaults str name=dfsuggestAggregate/str str name=spellcheck.dictionarydirect/str !--str name=spellcheck.dictionarywordbreak/str-- str name=spellcheckon/str str name=spellcheck.extendedResultstrue/str str name=spellcheck.count10/str str name=spellcheck.alternativeTermCount5/str str name=spellcheck.maxResultsForSuggest5/str str name=spellcheck.collatetrue/str str name=spellcheck.collateExtendedResultstrue/str str name=spellcheck.maxCollationTries10/str str name=spellcheck.maxCollations5/str /lst arr name=last-components strspellcheck/str /arr /requestHandler 5. The clusterstate.json is copied below {dyCollection1:{ shards:{ shard1:{ range:8000-d554, state:active, replicas:{ core_node3:{ state:active, core:dyCollection1_shard1_replica1, node_name:server3.mydomain.com:8082_solr, base_url:http://server3.mydomain.com:8082/solr}, core_node4:{ state:active, core:dyCollection1_shard1_replica2, node_name:server2.mydomain.com:8081_solr, base_url:http://server2.mydomain.com:8081/solr;, leader:true}}}, shard2:{ range:d555-2aa9, state:active, replicas:{ core_node1:{ state:active, core:dyCollection1_shard2_replica1, node_name:server1.mydomain.com:8081_solr, base_url:http://server1.mydomain.com:8081/solr;, leader:true}, core_node6:{ state:active, core:dyCollection1_shard2_replica2, node_name:server3.mydomain.com:8081_solr, base_url:http://server3.mydomain.com:8081/solr}}}, shard3:{ range:2aaa-7fff, state:active, replicas:{ core_node2:{ state:active, core:dyCollection1_shard3_replica2, node_name:server1.mydomain.com:8082_solr, base_url:http://server1.mydomain.com:8082/solr;, leader:true}, core_node5:{ state:active, core:dyCollection1_shard3_replica1, node_name:server2.mydomain.com:8082_solr, base_url:http://server2.mydomain.com:8082/solr, maxShardsPerNode:1, router:{name:compositeId}, replicationFactor:2, autoAddReplicas:false}} Thanks! On Thu, Oct 16, 2014 at 9:02 PM, Shawn Heisey apa...@elyograg.org wrote: On 10/16/2014 6:27 PM, S.L wrote: 1. Java Version :java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) I believe that build 51 is one of those that is known to have bugs related to Lucene. If you can upgrade this to 67, that would be good, but I don't know that it's a pressing matter. It looks like the Oracle JVM, which is good. 2.OS CentOS Linux release 7.0.1406 (Core) 3. Everything is 64 bit , OS , Java , and CPU. 4. Java Args. -Djava.io.tmpdir=/opt/tomcat1/temp -Dcatalina.home=/opt/tomcat1 -Dcatalina.base=/opt/tomcat1 -Djava.endorsed.dirs=/opt/tomcat1/endorsed -DzkHost=server1.mydomain.com:2181,server2.mydomain.com:2181, server3.mydomain.com:2181 -DzkClientTimeout=2 -DhostContext=solr -Dport=8081 -Dhost=server1.mydomain.com -Dsolr.solr.home=/opt/solr/home1 -Dfile.encoding=UTF8 -Duser.timezone=UTC -XX:+UseG1GC -XX:MaxPermSize=128m -XX:PermSize=64m -Xmx2048m -Xms128m -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.util.logging.config.file=/opt/tomcat1/conf/logging.properties I would not use the G1 collector myself, but with the heap at only 2GB, I don't know that it matters all that much. Even a worst-case collection probably is not going to take more than a few seconds, and you've already increased the zookeeper client timeout. http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning 5
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
.mydomain.com:8081/solr/dyCollection1_shard2_replica2/ str name=QTime14/str str name=ElapsedTime17/str str name=RequestPurposeGET_TOP_IDS/str str name=NumFound1/str str name=Response{responseHeader={status=0,QTime=14,params={spellcheck=true,spellcheck.maxCollationTries=10,distrib=false,debug=[false, track],version=2,NOW=1413398738457,shard.url= http://server1.mydomain.com:8081/solr/dyCollection1_shard2_replica1/|http://server3.mydomain.com:8081/solr/dyCollection1_shard2_replica2/,df=suggestAggregate,fl=thingURL,score,debugQuery=false,spellcheck.count=10,fq=(id:e8995da8-7d98-4010-93b4-8ff7dffb8bfb),fsv=true,spellcheck.alternativeTermCount=5,spellcheck.maxResultsForSuggest=5,spellcheck.collateExtendedResults=true,spellcheck.extendedResults=true,spellcheck.maxCollations=5,wt=javabin,spellcheck.collate=true,requestPurpose=GET_TOP_IDS,rows=10,rid=server3.mydomain.com-dyCollection1_shard2_replica2-1413398738457-16,start=0,q=*:*,shards.info=true,spellcheck.dictionary=[direct, wordbreak],isShard=true}},response={numFound=1,start=0,maxScore=1.0,docs=[SolrDocument{thingURL= http://www.redacted.com/ip/Cutter-Bite-MD-Insect-Bite-Relief-.5-fl-oz/12166875, score=1.0}]},sort_values={},debug={}}/str /lst lst name= http://server1.mydomain.com:8082/solr/dyCollection1_shard3_replica2/|http://server2.mydomain.com:8082/solr/dyCollection1_shard3_replica1/ str name=QTime26/str str name=ElapsedTime29/str str name=RequestPurposeGET_TOP_IDS/str str name=NumFound0/str str name=Response{responseHeader={status=0,QTime=26,params={spellcheck=true,spellcheck.maxCollationTries=10,distrib=false,debug=[false, track],version=2,NOW=1413398738457,shard.url= http://server1.mydomain.com:8082/solr/dyCollection1_shard3_replica2/|http://server2.mydomain.com:8082/solr/dyCollection1_shard3_replica1/,df=suggestAggregate,fl=thingURL,score,debugQuery=false,spellcheck.count=10,fq=(id:e8995da8-7d98-4010-93b4-8ff7dffb8bfb),fsv=true,spellcheck.alternativeTermCount=5,spellcheck.maxResultsForSuggest=5,spellcheck.collateExtendedResults=true,spellcheck.extendedResults=true,spellcheck.maxCollations=5,wt=javabin,spellcheck.collate=true,requestPurpose=GET_TOP_IDS,rows=10,rid=server3.mydomain.com-dyCollection1_shard2_replica2-1413398738457-16,start=0,q=*:*,shards.info=true,spellcheck.dictionary=[direct, wordbreak],isShard=true}},response={numFound=0,start=0,maxScore=0.0,docs=[]},sort_values={},debug={}}/str /lst /lst lst name=GET_FIELDS lst name= http://server1.mydomain.com:8081/solr/dyCollection1_shard2_replica1/|http://server3.mydomain.com:8081/solr/dyCollection1_shard2_replica2/ str name=QTime1/str str name=ElapsedTime3/str str name=RequestPurposeGET_FIELDS,GET_DEBUG/str str name=NumFound1/str str name=Response{responseHeader={status=0,QTime=1,params={spellcheck=false,spellcheck.maxCollationTries=10,distrib=false,debug=[track, track],version=2,df=suggestAggregate,shard.url= http://server1.mydomain.com:8081/solr/dyCollection1_shard2_replica1/|http://server3.mydomain.com:8081/solr/dyCollection1_shard2_replica2/,NOW=1413398738457,spellcheck.count=10,fq=(id:e8995da8-7d98-4010-93b4-8ff7dffb8bfb),spellcheck.alternativeTermCount=5,spellcheck.maxResultsForSuggest=5,spellcheck.collateExtendedResults=true,spellcheck.extendedResults=true,spellcheck.maxCollations=5,ids=http://www.redacted.com/ip/Cutter-Bite,spellcheck.collate=true,wt=javabin,requestPurpose=GET_FIELDS,GET_DEBUG,rows=10,rid=server3.mydomain.com-dyCollection1_shard2_replica2-1413398738457-16,q=*:*,shards.info=true,spellcheck.dictionary=[direct, wordbreak],isShard=true}},response={numFound=1,start=0,docs=[SolrDocument{thingURL= http://www.redacted.com/ip/Cutter-Bite, id=e8995da8-7d98-4010-93b4-8ff7dffb8bfb, _version_=1481991045188157440}]},debug={}}/str /lst /lst /lst /lst /response On Tue, Oct 14, 2014 at 10:32 AM, Tim Potter tim.pot...@lucidworks.com wrote: Try adding shards.info=true and debug=track to your queries ... these will give more detailed information about what's going behind the scenes. On Mon, Oct 13, 2014 at 11:11 PM, S.L simpleliving...@gmail.com wrote: Erick, I have upgraded to SolrCloud 4.10.1 with the same toplogy , 3 shards and 2 replication factor with six cores altogether. Unfortunately , I still see the issue of intermittently no results being returned.I am not able to figure out whats going on here, I have included the logging information below. *Here's the query that I run.* http://server1.mydomain.com:8081/solr/dyCollection1/select/?q=*:*fq=%28id:220a8dce-3b31-4d46-8386-da8405595c47%29wt=jsondistrib=true *Scenario 1: No result returned.* *Log Information for Scenario #1 .* 92860314 [http-bio-8081-exec-103] INFO
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Look at the logging information I provided below , looks like the results are only being returned back for this solrCloud cluster if the request goes to one of the two replicas of a shard. I have verified that numDocs in the replicas for a given shard is same but there is difference in the maxDoc and deletedDocs, does this signal the replicas being out of sync ? Even if the numDocs are same , how do we guarantee that those docs are identical and have the same uniquekeys , is there a way to verify this ? I am suspecting that as the numDocs is same across the replicas , and still only when the request goes to one of the replicas of the shard that I get a result back , the documents with in those replicas with in a shard are not an exact replica set of each other. I suspect the issue I am facing in 4.10.1 cloud is related to https://issues.apache.org/jira/browse/SOLR-4924 . Can anyone please let me know , how to solve this issue of intermittent no results for a query ? On Wed, Oct 15, 2014 at 3:15 PM, S.L simpleliving...@gmail.com wrote: Tim, Thanks for the suggestion. I have rerun the query by adding shards.info=true and debug= track. I have included the xml data for both teh scenarios below , thin happens intermittently on SolrCloud 4.10.1 , with a replication factor of 2 and 3 shards (6 cores) , I get result in one execution of query and then no results for the subsequent one , I am hoping someone would be able to help me find the root cause with this additional information ,I have included the query output with the additional parameters for the both the scenarios below . Thanks for your help! *Scenario #1 : In this try I get no results back. Here is what the query returns.* ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime29/int lst name=params str name=q*:*/str str name=shards.infotrue/str str name=distribtrue/str str name=debugtrack/str str name=wtxml/str str name=fq(id:e8995da8-7d98-4010-93b4-8ff7dffb8bfb)/str /lst /lst lst name=shards.info lst name= http://server1.mydomain.com:8082/solr/dyCollection1_shard3_replica2/|http://server2.mydomain.com:8082/solr/dyCollection1_shard3_replica1/ long name=numFound0/long float name=maxScore0.0/float str name=shardAddress http://server2.mydomain.com:8082/solr/dyCollection1_shard3_replica1/str long name=time4/long /lst lst name= http://server3.mydomain.com:8082/solr/dyCollection1_shard1_replica1/|http://server2.mydomain.com:8081/solr/dyCollection1_shard1_replica2/ long name=numFound0/long float name=maxScore0.0/float str name=shardAddress http://server3.mydomain.com:8082/solr/dyCollection1_shard1_replica1/str long name=time13/long /lst lst name= http://server1.mydomain.com:8081/solr/dyCollection1_shard2_replica1/|http://server3.mydomain.com:8081/solr/dyCollection1_shard2_replica2/ long name=numFound0/long float name=maxScore0.0/float str name=shardAddress http://server1.mydomain.com:8081/solr/dyCollection1_shard2_replica1/str long name=time26/long /lst /lst result name=response numFound=0 start=0 maxScore=0.0 / lst name=spellcheck lst name=suggestions bool name=correctlySpelledfalse/bool /lst /lst lst name=debug lst name=track str name=ridserver3.mydomain.com-dyCollection1_shard2_replica2-1413398784226-17/str lst name=EXECUTE_QUERY lst name= http://server1.mydomain.com:8082/solr/dyCollection1_shard3_replica2/|http://server2.mydomain.com:8082/solr/dyCollection1_shard3_replica1/ str name=QTime1/str str name=ElapsedTime4/str str name=RequestPurposeGET_TOP_IDS/str str name=NumFound0/str str name=Response{responseHeader={status=0,QTime=1,params={spellcheck=true,spellcheck.maxCollationTries=10,distrib=false,debug=[false, track],version=2,NOW=1413398784225,shard.url= http://server1.mydomain.com:8082/solr/dyCollection1_shard3_replica2/|http://server2.mydomain.com:8082/solr/dyCollection1_shard3_replica1/,df=suggestAggregate,fl=thingURL,score,debugQuery=false,spellcheck.count=10,fq=(id:e8995da8-7d98-4010-93b4-8ff7dffb8bfb),fsv=true,spellcheck.alternativeTermCount=5,spellcheck.maxResultsForSuggest=5,spellcheck.collateExtendedResults=true,spellcheck.extendedResults=true,spellcheck.maxCollations=5,wt=javabin,spellcheck.collate=true,requestPurpose=GET_TOP_IDS,rows=10,rid=server3.mydomain.com-dyCollection1_shard2_replica2-1413398784226-17,start=0,q=*:*,shards.info=true,spellcheck.dictionary=[direct, wordbreak],isShard=true}},response={numFound=0,start=0,maxScore=0.0,docs=[]},sort_values={},debug={}}/str /lst lst name= http://server3
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Shawn, Yes , I tried those two queries with distrib=false , I get 0 results for first and 1 result for the second query( (i.e. server 3 shard 2 replica 2) consistently. However if I run the same second query (i.e. server 3 shard 2 replica 2) with distrib=true, I sometimes get a result and sometimes not , should'nt this query always return a result when its pointing to a core that seems to have that document regardless of distrib=true or false ? Unfortunately I dont see anything particular in the logs to point to any information. BTW you asked me to replace the request handler , I use the select request handler ,so I cannot replace it with anything else , is that a problem ? Thanks. On Thu, Oct 16, 2014 at 12:05 AM, Shawn Heisey apa...@elyograg.org wrote: On 10/15/2014 9:26 PM, S.L wrote: Look at the logging information I provided below , looks like the results are only being returned back for this solrCloud cluster if the request goes to one of the two replicas of a shard. I have verified that numDocs in the replicas for a given shard is same but there is difference in the maxDoc and deletedDocs, does this signal the replicas being out of sync ? Even if the numDocs are same , how do we guarantee that those docs are identical and have the same uniquekeys , is there a way to verify this ? I am suspecting that as the numDocs is same across the replicas , and still only when the request goes to one of the replicas of the shard that I get a result back , the documents with in those replicas with in a shard are not an exact replica set of each other. I suspect the issue I am facing in 4.10.1 cloud is related to https://issues.apache.org/jira/browse/SOLR-4924 . Can anyone please let me know , how to solve this issue of intermittent no results for a query ? query with no results hits these cores: server 2 shard 3 replica1 server 3 shard 1 replica 1 server 1 shard 2 replica 1 query with 1 result hits these cores: server 2 shard 1 replica 2 server 3 shard 2 replica 2 (found 1) server 1 shard 3 replica 2 Here's some URLs for some testing. They are directed at specific shard replicas and are specifically NOT distributed queries: http://server1.mydomain.com:8081/solr/dyCollection1_ shard2_replica1/select?q=*:*fq=id:e8995da8-7d98-4010-93b4- 8ff7dffb8bfbdistrib=false http://server3.mydomain.com:8081/solr/dyCollection1_ shard2_replica2/select?q=*:*fq=id:e8995da8-7d98-4010-93b4- 8ff7dffb8bfbdistrib=false If you run these queries (replacing server names and the /select request handler as appropriate), do you get 0 results on the first one and 1 result on the second one? If you do, then you've definitely got replicas out of sync. If you get 1 result on both queries, then something else is breaking. If by chance you have taken steps to fix this particular ID, pick another one that you know has a problem. There is no automated way to detect replicas out of sync. You could request all docs on both replicas with distrib=falsefl=idsort=id+asc, then compare the two lists. Depending on how many docs you have, those queries could take a while to run. If the replicas are out of sync, are there any ERROR entries in the Solr log, especially at the time that the problem docs were indexed? Thanks, Shawn
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
to track down, you just are lucky perhaps ;)... Erick On Mon, Oct 6, 2014 at 8:04 PM, S.L simpleliving...@gmail.com wrote: Erick, Thanks for the suggestion , I am not sure if I would be able to capture what went wrong , so upgrading to 4.10 seems easier even though it means , a days work of effort :) . I will go ahead and upgrade and let me know , although I am surprised that this issue never got reported for 4.7 up until now. Thanks again for your help! On Mon, Oct 6, 2014 at 10:52 PM, Erick Erickson erickerick...@gmail.com wrote: I think there were some holes that would allow replicas and leaders to be out of synch that have been patched up in the last 3 releases. There shouldn't be anything you need to do to keep these in synch, so if you can capture what happened when things got out of synch we'll fix it. But a lot has changed in the last several months, so the first thing I'd do if possible is to upgrade to 4.10.1. Best, Erick On Mon, Oct 6, 2014 at 2:41 PM, S.L simpleliving...@gmail.com wrote: Hi Erick, Before I tried your suggestion of issung a commit=true update, I realized that for eaach shard there was atleast a node that had its index directory named like index.timestamp. I went ahead and deleted index directory that restarted that core and now the index directory got syched with the other node and is properly named as 'index' without any timestamp attached to it.This is now giving me consistent results for distrib=true using a load balancer.Also distrib=false returns expexted results for a given shard. The underlying issue appears to be that in every shard the leader and the replica(follower) were out of sych. How can I avoid this from happening again? Thanks for your help! Sent from my HTC - Reply message - From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Subject: SolrCloud 4.7 not doing distributed search when querying from a load balancer. Date: Fri, Oct 3, 2014 12:56 AM H. Assuming that you aren't re-indexing the doc you're searching for... Try issuing http://blah blah:8983/solr/collection/update?commit=true. That'll force all the docs to be searchable. Does 1 still hold for the document in question? Because this is exactly backwards of what I'd expect. I'd expect, if anything, the replica (I'm trying to call it the follower when a distinction needs to be made since the leader is a replica too) would be out of sync. This is still a Bad Thing, but the leader gets first crack at indexing thing. bq: only the replica of the shard that has this key returns the result , and the leader does not , Just to be sure we're talking about the same thing. When you say leader, you mean the shard leader, right? The filled-in circle on the graph view from the admin/cloud page. And let's see your soft and hard commit settings please. Best, Erick On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com wrote: Eirck, 0 Load balancer is out of the picture . 1When I query with *distrib=false* , I get consistent results as expected for those shards that dont have the key i.e I dont get the results back for those shards, however I just realized that while *distrib=false* is present in the query for the shard that is supposed to contain the key,only the replica of the shard that has this key returns the result , and the leader does not , looks like replica and the leader do not have the same data and replica seems to contain the key in the query for that shard. 2 By indexing I mean this collection is being populated by a web crawler. So looks like 1 above is pointing to leader and replica being out of synch for atleast one shard. On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson erickerick...@gmail.com wrote: bq: Also ,the collection is being actively indexed as I query this, could that be an issue too ? Not if the documents you're searching aren't being added as you search (and all your autocommit intervals have expired). I would turn off indexing for testing, it's just one more variable that can get in the way of understanding this. Do note that if the problem were endemic to Solr, there would probably be a _lot_ more noise out there. So to recap: 0 we can take the load balancer out of the picture all together. 1 when you query each shard individually with distrib=true, every replica in a particular shard returns the same count. 2 when you query without distrib=true you get varying counts. This is very strange and not at all expected. Let's try it again without indexing going on And what do you mean by indexing anyway? How are documents being fed to your system? Best, Erick
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Hi Erick, Before I tried your suggestion of issung a commit=true update, I realized that for eaach shard there was atleast a node that had its index directory named like index.timestamp. I went ahead and deleted index directory that restarted that core and now the index directory got syched with the other node and is properly named as 'index' without any timestamp attached to it.This is now giving me consistent results for distrib=true using a load balancer.Also distrib=false returns expexted results for a given shard. The underlying issue appears to be that in every shard the leader and the replica(follower) were out of sych. How can I avoid this from happening again? Thanks for your help! Sent from my HTC - Reply message - From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Subject: SolrCloud 4.7 not doing distributed search when querying from a load balancer. Date: Fri, Oct 3, 2014 12:56 AM H. Assuming that you aren't re-indexing the doc you're searching for... Try issuing http://blah blah:8983/solr/collection/update?commit=true. That'll force all the docs to be searchable. Does 1 still hold for the document in question? Because this is exactly backwards of what I'd expect. I'd expect, if anything, the replica (I'm trying to call it the follower when a distinction needs to be made since the leader is a replica too) would be out of sync. This is still a Bad Thing, but the leader gets first crack at indexing thing. bq: only the replica of the shard that has this key returns the result , and the leader does not , Just to be sure we're talking about the same thing. When you say leader, you mean the shard leader, right? The filled-in circle on the graph view from the admin/cloud page. And let's see your soft and hard commit settings please. Best, Erick On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com wrote: Eirck, 0 Load balancer is out of the picture . 1When I query with *distrib=false* , I get consistent results as expected for those shards that dont have the key i.e I dont get the results back for those shards, however I just realized that while *distrib=false* is present in the query for the shard that is supposed to contain the key,only the replica of the shard that has this key returns the result , and the leader does not , looks like replica and the leader do not have the same data and replica seems to contain the key in the query for that shard. 2 By indexing I mean this collection is being populated by a web crawler. So looks like 1 above is pointing to leader and replica being out of synch for atleast one shard. On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson erickerick...@gmail.com wrote: bq: Also ,the collection is being actively indexed as I query this, could that be an issue too ? Not if the documents you're searching aren't being added as you search (and all your autocommit intervals have expired). I would turn off indexing for testing, it's just one more variable that can get in the way of understanding this. Do note that if the problem were endemic to Solr, there would probably be a _lot_ more noise out there. So to recap: 0 we can take the load balancer out of the picture all together. 1 when you query each shard individually with distrib=true, every replica in a particular shard returns the same count. 2 when you query without distrib=true you get varying counts. This is very strange and not at all expected. Let's try it again without indexing going on And what do you mean by indexing anyway? How are documents being fed to your system? Best, Erick@PuzzledAsWell On Thu, Oct 2, 2014 at 7:32 PM, S.L simpleliving...@gmail.com wrote: Erick, I would like to add that the interesting behavior i.e point #2 that I mentioned in my earlier reply happens in all the shards , if this were to be a distributed search issue this should have not manifested itself in the shard that contains the key that I am searching for , looks like the search is just failing as whole intermittently . Also ,the collection is being actively indexed as I query this, could that be an issue too ? Thanks. On Thu, Oct 2, 2014 at 10:24 PM, S.L simpleliving...@gmail.com wrote: Erick, Thanks for your reply, I tried your suggestions. 1 . When not using loadbalancer if *I have distrib=false* I get consistent results across the replicas. 2. However here's the insteresting part , while not using load balancer if I *dont have distrib=false* , then when I query a particular node ,I get the same behaviour as if I were using a loadbalancer , meaning the distributed search from a node works intermittently .Does this give any clue ? On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, nothing quite makes sense here Here are some experiments: 1 avoid the load balancer and issue queries like
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Erick, Thanks for the suggestion , I am not sure if I would be able to capture what went wrong , so upgrading to 4.10 seems easier even though it means , a days work of effort :) . I will go ahead and upgrade and let me know , although I am surprised that this issue never got reported for 4.7 up until now. Thanks again for your help! On Mon, Oct 6, 2014 at 10:52 PM, Erick Erickson erickerick...@gmail.com wrote: I think there were some holes that would allow replicas and leaders to be out of synch that have been patched up in the last 3 releases. There shouldn't be anything you need to do to keep these in synch, so if you can capture what happened when things got out of synch we'll fix it. But a lot has changed in the last several months, so the first thing I'd do if possible is to upgrade to 4.10.1. Best, Erick On Mon, Oct 6, 2014 at 2:41 PM, S.L simpleliving...@gmail.com wrote: Hi Erick, Before I tried your suggestion of issung a commit=true update, I realized that for eaach shard there was atleast a node that had its index directory named like index.timestamp. I went ahead and deleted index directory that restarted that core and now the index directory got syched with the other node and is properly named as 'index' without any timestamp attached to it.This is now giving me consistent results for distrib=true using a load balancer.Also distrib=false returns expexted results for a given shard. The underlying issue appears to be that in every shard the leader and the replica(follower) were out of sych. How can I avoid this from happening again? Thanks for your help! Sent from my HTC - Reply message - From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Subject: SolrCloud 4.7 not doing distributed search when querying from a load balancer. Date: Fri, Oct 3, 2014 12:56 AM H. Assuming that you aren't re-indexing the doc you're searching for... Try issuing http://blah blah:8983/solr/collection/update?commit=true. That'll force all the docs to be searchable. Does 1 still hold for the document in question? Because this is exactly backwards of what I'd expect. I'd expect, if anything, the replica (I'm trying to call it the follower when a distinction needs to be made since the leader is a replica too) would be out of sync. This is still a Bad Thing, but the leader gets first crack at indexing thing. bq: only the replica of the shard that has this key returns the result , and the leader does not , Just to be sure we're talking about the same thing. When you say leader, you mean the shard leader, right? The filled-in circle on the graph view from the admin/cloud page. And let's see your soft and hard commit settings please. Best, Erick On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com wrote: Eirck, 0 Load balancer is out of the picture . 1When I query with *distrib=false* , I get consistent results as expected for those shards that dont have the key i.e I dont get the results back for those shards, however I just realized that while *distrib=false* is present in the query for the shard that is supposed to contain the key,only the replica of the shard that has this key returns the result , and the leader does not , looks like replica and the leader do not have the same data and replica seems to contain the key in the query for that shard. 2 By indexing I mean this collection is being populated by a web crawler. So looks like 1 above is pointing to leader and replica being out of synch for atleast one shard. On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson erickerick...@gmail.com wrote: bq: Also ,the collection is being actively indexed as I query this, could that be an issue too ? Not if the documents you're searching aren't being added as you search (and all your autocommit intervals have expired). I would turn off indexing for testing, it's just one more variable that can get in the way of understanding this. Do note that if the problem were endemic to Solr, there would probably be a _lot_ more noise out there. So to recap: 0 we can take the load balancer out of the picture all together. 1 when you query each shard individually with distrib=true, every replica in a particular shard returns the same count. 2 when you query without distrib=true you get varying counts. This is very strange and not at all expected. Let's try it again without indexing going on And what do you mean by indexing anyway? How are documents being fed to your system? Best, Erick@PuzzledAsWell On Thu, Oct 2, 2014 at 7:32 PM, S.L simpleliving...@gmail.com wrote: Erick, I would like to add that the interesting behavior i.e point #2 that I mentioned in my earlier reply happens in all the shards , if this were to be a distributed search issue
SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Hi All, I am trying to query a 6 node Solr4.7 cluster with 3 shards and a replication factor of 2 . I have fronted these 6 Solr nodes using a load balancer , what I notice is that every time I do a search of the form q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf) it gives me a result only once in every 3 tries , telling me that the load balancer is distributing the requests between the 3 shards and SolrCloud only returns a result if the request goes to the core that as that id . However if I do a simple search like q=*:* , I consistently get the right aggregated results back of all the documents across all the shards for every request from the load balancer. Can someone please let me know what this is symptomatic of ? Somehow Solr Cloud seems to be doing search query distribution and aggregation for queries of type *:* only. Thanks.
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Erick, Thanks for your reply, I tried your suggestions. 1 . When not using loadbalancer if *I have distrib=false* I get consistent results across the replicas. 2. However here's the insteresting part , while not using load balancer if I *dont have distrib=false* , then when I query a particular node ,I get the same behaviour as if I were using a loadbalancer , meaning the distributed search from a node works intermittently .Does this give any clue ? On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, nothing quite makes sense here Here are some experiments: 1 avoid the load balancer and issue queries like http://solr_server:8983/solr/collection/q=whateverdistrib=false the distrib=false bit will cause keep SolrCloud from trying to send the queries anywhere, they'll be served only from the node you address them to. that'll help check whether the nodes are consistent. You should be getting back the same results from each replica in a shard (i.e. 2 of your 6 machines). Next, try your failing query the same way. Next, try your failing query from a browser, pointing it at successive nodes. Where is the first place problems show up? My _guess_ is that your load balancer isn't quite doing what you think, or your cluster isn't set up the way you think it is, but those are guesses. Best, Erick On Thu, Oct 2, 2014 at 2:51 PM, S.L simpleliving...@gmail.com wrote: Hi All, I am trying to query a 6 node Solr4.7 cluster with 3 shards and a replication factor of 2 . I have fronted these 6 Solr nodes using a load balancer , what I notice is that every time I do a search of the form q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf) it gives me a result only once in every 3 tries , telling me that the load balancer is distributing the requests between the 3 shards and SolrCloud only returns a result if the request goes to the core that as that id . However if I do a simple search like q=*:* , I consistently get the right aggregated results back of all the documents across all the shards for every request from the load balancer. Can someone please let me know what this is symptomatic of ? Somehow Solr Cloud seems to be doing search query distribution and aggregation for queries of type *:* only. Thanks.
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Erick, I would like to add that the interesting behavior i.e point #2 that I mentioned in my earlier reply happens in all the shards , if this were to be a distributed search issue this should have not manifested itself in the shard that contains the key that I am searching for , looks like the search is just failing as whole intermittently . Also ,the collection is being actively indexed as I query this, could that be an issue too ? Thanks. On Thu, Oct 2, 2014 at 10:24 PM, S.L simpleliving...@gmail.com wrote: Erick, Thanks for your reply, I tried your suggestions. 1 . When not using loadbalancer if *I have distrib=false* I get consistent results across the replicas. 2. However here's the insteresting part , while not using load balancer if I *dont have distrib=false* , then when I query a particular node ,I get the same behaviour as if I were using a loadbalancer , meaning the distributed search from a node works intermittently .Does this give any clue ? On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, nothing quite makes sense here Here are some experiments: 1 avoid the load balancer and issue queries like http://solr_server:8983/solr/collection/q=whateverdistrib=false the distrib=false bit will cause keep SolrCloud from trying to send the queries anywhere, they'll be served only from the node you address them to. that'll help check whether the nodes are consistent. You should be getting back the same results from each replica in a shard (i.e. 2 of your 6 machines). Next, try your failing query the same way. Next, try your failing query from a browser, pointing it at successive nodes. Where is the first place problems show up? My _guess_ is that your load balancer isn't quite doing what you think, or your cluster isn't set up the way you think it is, but those are guesses. Best, Erick On Thu, Oct 2, 2014 at 2:51 PM, S.L simpleliving...@gmail.com wrote: Hi All, I am trying to query a 6 node Solr4.7 cluster with 3 shards and a replication factor of 2 . I have fronted these 6 Solr nodes using a load balancer , what I notice is that every time I do a search of the form q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf) it gives me a result only once in every 3 tries , telling me that the load balancer is distributing the requests between the 3 shards and SolrCloud only returns a result if the request goes to the core that as that id . However if I do a simple search like q=*:* , I consistently get the right aggregated results back of all the documents across all the shards for every request from the load balancer. Can someone please let me know what this is symptomatic of ? Somehow Solr Cloud seems to be doing search query distribution and aggregation for queries of type *:* only. Thanks.
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Eirck, 0 Load balancer is out of the picture . 1When I query with *distrib=false* , I get consistent results as expected for those shards that dont have the key i.e I dont get the results back for those shards, however I just realized that while *distrib=false* is present in the query for the shard that is supposed to contain the key,only the replica of the shard that has this key returns the result , and the leader does not , looks like replica and the leader do not have the same data and replica seems to contain the key in the query for that shard. 2 By indexing I mean this collection is being populated by a web crawler. So looks like 1 above is pointing to leader and replica being out of synch for atleast one shard. On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson erickerick...@gmail.com wrote: bq: Also ,the collection is being actively indexed as I query this, could that be an issue too ? Not if the documents you're searching aren't being added as you search (and all your autocommit intervals have expired). I would turn off indexing for testing, it's just one more variable that can get in the way of understanding this. Do note that if the problem were endemic to Solr, there would probably be a _lot_ more noise out there. So to recap: 0 we can take the load balancer out of the picture all together. 1 when you query each shard individually with distrib=true, every replica in a particular shard returns the same count. 2 when you query without distrib=true you get varying counts. This is very strange and not at all expected. Let's try it again without indexing going on And what do you mean by indexing anyway? How are documents being fed to your system? Best, Erick@PuzzledAsWell On Thu, Oct 2, 2014 at 7:32 PM, S.L simpleliving...@gmail.com wrote: Erick, I would like to add that the interesting behavior i.e point #2 that I mentioned in my earlier reply happens in all the shards , if this were to be a distributed search issue this should have not manifested itself in the shard that contains the key that I am searching for , looks like the search is just failing as whole intermittently . Also ,the collection is being actively indexed as I query this, could that be an issue too ? Thanks. On Thu, Oct 2, 2014 at 10:24 PM, S.L simpleliving...@gmail.com wrote: Erick, Thanks for your reply, I tried your suggestions. 1 . When not using loadbalancer if *I have distrib=false* I get consistent results across the replicas. 2. However here's the insteresting part , while not using load balancer if I *dont have distrib=false* , then when I query a particular node ,I get the same behaviour as if I were using a loadbalancer , meaning the distributed search from a node works intermittently .Does this give any clue ? On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, nothing quite makes sense here Here are some experiments: 1 avoid the load balancer and issue queries like http://solr_server:8983/solr/collection/q=whateverdistrib=false the distrib=false bit will cause keep SolrCloud from trying to send the queries anywhere, they'll be served only from the node you address them to. that'll help check whether the nodes are consistent. You should be getting back the same results from each replica in a shard (i.e. 2 of your 6 machines). Next, try your failing query the same way. Next, try your failing query from a browser, pointing it at successive nodes. Where is the first place problems show up? My _guess_ is that your load balancer isn't quite doing what you think, or your cluster isn't set up the way you think it is, but those are guesses. Best, Erick On Thu, Oct 2, 2014 at 2:51 PM, S.L simpleliving...@gmail.com wrote: Hi All, I am trying to query a 6 node Solr4.7 cluster with 3 shards and a replication factor of 2 . I have fronted these 6 Solr nodes using a load balancer , what I notice is that every time I do a search of the form q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf) it gives me a result only once in every 3 tries , telling me that the load balancer is distributing the requests between the 3 shards and SolrCloud only returns a result if the request goes to the core that as that id . However if I do a simple search like q=*:* , I consistently get the right aggregated results back of all the documents across all the shards for every request from the load balancer. Can someone please let me know what this is symptomatic of ? Somehow Solr Cloud seems to be doing search query distribution and aggregation for queries of type *:* only. Thanks.
pySolr and other Python client options for SolrCloud.
Hi All, We recently moved from a single Solr instance to SolrCloud and we are using pysolr , I am wondering what options (clients) we have from Python to take advantage of Zookeeper and load balancing capabilities that SolrCloud provides if I were to use a smart client like Solrj? Thanks.
Re: pySolr and other Python client options for SolrCloud.
Right , but my query was to know if there are any Python clients which achieve the same thing as SolrJ , or the approach one should take when using Python based clients. On Wed, Oct 1, 2014 at 3:57 PM, Upayavira u...@odoko.co.uk wrote: On Wed, Oct 1, 2014, at 08:47 PM, S.L wrote: Hi All, We recently moved from a single Solr instance to SolrCloud and we are using pysolr , I am wondering what options (clients) we have from Python to take advantage of Zookeeper and load balancing capabilities that SolrCloud provides if I were to use a smart client like Solrj? Obviously SolrJ is Java, not Python. SolrJ has integration with Zookeeper, so when you instantiate a CloudSolrServer instance, you tell it where Zookeeper is, not Solr. Your app then consults Zookeeper to find out which Solr instance to talk to. This means you can move stuff around within your infrastructure without needing to tell your app, and without needing to mess with load balancers as that is all handled for you by the SolrJ client deciding which node to forward your request. Upayavira
Re: pySolr and other Python client options for SolrCloud.
Shawn, Thanks ,load balancer seems to be the preferred solution here , I have a topology where I have 6 Solr nodes that support 3 shards with a replication factor of 2. Looks like it woul dbe better to use the load balancers for querying only.The question, that I have is if I go the load balancer route should I be listing all the six nodes in the load balancer or only the leaders as identified by SolrCloud admin console?Would the load balancing solution also incur any additional routing of requests between the individual nodes of SolrCloud that would have not happened had the python Solr client been zookeeper aware? Also for indexing ,which is not done from a Python client but is done using Solrj, I will avoid the load balancers and do the indexing it via the Zookeeper route. Thanks. On Wed, Oct 1, 2014 at 8:42 PM, Shawn Heisey apa...@elyograg.org wrote: On 10/1/2014 2:29 PM, S.L wrote: Right , but my query was to know if there are any Python clients which achieve the same thing as SolrJ , or the approach one should take when using Python based clients. If the python client can support multiple hosts and failing over between them, then you would simply list multiple URLs. If not, then you'll need a load balancer. I use haproxy with Solr (not in Cloud mode) for automatic failover, and it should work equally well for SolrCloud and a non-java client. It looks like Alexandre knows a lot more about it than I do ... I know very little about python. Thanks, Shawn
Re: pySolr and other Python client options for SolrCloud.
That makes perfect sense , thanks again! On Wed, Oct 1, 2014 at 10:09 PM, Shawn Heisey apa...@elyograg.org wrote: On 10/1/2014 7:08 PM, S.L wrote: Thanks ,load balancer seems to be the preferred solution here , I have a topology where I have 6 Solr nodes that support 3 shards with a replication factor of 2. Looks like it woul dbe better to use the load balancers for querying only.The question, that I have is if I go the load balancer route should I be listing all the six nodes in the load balancer or only the leaders as identified by SolrCloud admin console?Would the load balancing solution also incur any additional routing of requests between the individual nodes of SolrCloud that would have not happened had the python Solr client been zookeeper aware? Also for indexing ,which is not done from a Python client but is done using Solrj, I will avoid the load balancers and do the indexing it via the Zookeeper route. If you were to send all your queries to just one server, it's my understanding that SolrCloud will load balance the actual work across the cloud. I have not verified this. For a load balancer, the minimum requirement would be to list two of the servers, but it's probably better to list them all. Leader designations can change, and I'm pretty sure you don't want to change your load balancer config just because the leader changed. If your 3 shards are using automatic document routing, then you can send updates to any machine in the cluster and they'll end up in the right place. Since you're using SolrJ for updates, this is probably not something you need to worry about. Thanks, Shawn
Intermittent error indexing SolrCloud 4.7.0
Hi All, I get No Live SolrServers available to handle this request error intermittently while indexing in a SolrCloud cluster with 3 shards and replication factor of 2. I am using Solr 4.7.0. Please see the stack trace below. org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:352) ~[DynaOCrawlerUtils.jar:?] at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:640) ~[DynaOCrawlerUtils.jar:?] at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) ~[DynaOCrawlerUtils.jar:?] at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168) ~[DynaOCrawlerUtils.jar:?] at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:146) ~[DynaOCrawlerUtils.jar:?]
Crawl-Delay in robots.txt and fetcher.threads.per.queue property in Nutch
Hello All If I set fetcher.threads.per.queue property to more than 1 , I believe the behavior would be to have those many number of threads per host from Nutch, in that case would Nutch still respect the Crawl-Delay directive in robots.txt and not crawl at a faster pace that what is specified in robots.txt. In short what I am trying to ask is if setting fetcher.threads.per.queue to 1 is required for being as polite as Crawl-Delay in robots.txt expects? Thx
Spell checker - limit on number of misspelt words in a search term.
Hi All, I am using the Direct Spell checker component and I have collate =true in my solrconfig.xml. The issue that I noticed is that , when I have a search term with upto two words in it and if both of them are misspelled I get a collation query as a suggestion in the spellchecker output, if I increase the search term length to 3 words and spell all of them incorrectly then I do not get a collation query as an output in the spell checker suggestions. Is there a setting in solrconfig.xml file that's controlling this behavior by restricting the length of the search term to be up to two misspelt words to suggest a collation query, if so I would need to change the property. Can anyone please let me know how to do so ? Thanks. Sent from my mobile.
Re: Is it possible for solr to calculate and give back the price of a product based on its sub-products
I am not sure if that is doable , I think it needs to be taken care of at the indexing time. On Sun, Jun 8, 2014 at 4:55 PM, Gharbi Mohamed gharbi.mohamed.e...@gmail.com wrote: Hi, I am using Solr for searching magento products in my project, I want to know, is it possible for solr to calculate and give back the price of a product based on its sub-products(items); For instance, i have a product P1 and it is the parent of items m1, m2. i need to get the minimal price of items and return it as a price of product P1. I'm wondering if that is possible ? I need to know if solr can do that or if there is a feature or a way to do it ? And finally i thank you! regards, Mohamed.
Re: Strange Behavior with Solr in Tomcat.
Thanks, Meraj, that was exactly the issue , setting useColdSearchertrue/useColdSearcher worked like a charm and the server starts up as usual. Thanks again! On Fri, Jun 6, 2014 at 2:42 PM, Meraj A. Khan mera...@gmail.com wrote: This looks distinctly related to https://issues.apache.org/jira/browse/SOLR-4408 , try coldSearcher = true as being suggested in JIRA and let us know . On Fri, Jun 6, 2014 at 2:39 PM, Jean-Sebastien Vachon jean-sebastien.vac...@wantedanalytics.com wrote: I would try a thread dump and check the output to see what`s going on. You could also strace the process if you`re running on Unix or changed the log level in Solr to get more information logged -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: June-06-14 2:33 PM To: solr-user@lucene.apache.org Subject: Re: Strange Behavior with Solr in Tomcat. Anyone folks? On Wed, Jun 4, 2014 at 10:25 AM, S.L simpleliving...@gmail.com wrote: Hi Folks, I recently started using the spellchecker in my solrconfig.xml. I am able to build up an index in Solr. But,if I ever shutdown tomcat I am not able to restart it.The server never spits out the server startup time in seconds in the logs,nor does it print any error messages in the catalina.out file. The only way for me to get around this is by delete the data directory of the index and then start the server,obviously this makes me loose my index. Just wondering if anyone faced a similar issue and if they were able to solve this. Thanks. - Aucun virus trouvé dans ce message. Analyse effectuée par AVG - www.avg.fr Version: 2014.0.4570 / Base de données virale: 3950/7571 - Date: 27/05/2014 La Base de données des virus a expiré.
Re: Strange Behavior with Solr in Tomcat.
Anyone folks? On Wed, Jun 4, 2014 at 10:25 AM, S.L simpleliving...@gmail.com wrote: Hi Folks, I recently started using the spellchecker in my solrconfig.xml. I am able to build up an index in Solr. But,if I ever shutdown tomcat I am not able to restart it.The server never spits out the server startup time in seconds in the logs,nor does it print any error messages in the catalina.out file. The only way for me to get around this is by delete the data directory of the index and then start the server,obviously this makes me loose my index. Just wondering if anyone faced a similar issue and if they were able to solve this. Thanks.
Strange Behavior with Solr in Tomcat.
Hi Folks, I recently started using the spellchecker in my solrconfig.xml. I am able to build up an index in Solr. But,if I ever shutdown tomcat I am not able to restart it.The server never spits out the server startup time in seconds in the logs,nor does it print any error messages in the catalina.out file. The only way for me to get around this is by delete the data directory of the index and then start the server,obviously this makes me loose my index. Just wondering if anyone faced a similar issue and if they were able to solve this. Thanks.
Re: Strange Behavior with Solr in Tomcat.
Hi, This is not a case of accidental deletion , the only way I can restart the tomcat is by deleting the data directory for the index that was created earlier, this started happening after I started using spellcheckers in my solrconfig.xml. As long as the Tomcat is running its fine. Any help from anyone who faced a similar issues would be appreciated. Thanks. On Wed, Jun 4, 2014 at 11:08 AM, Aman Tandon antn.s...@gmail.com wrote: I guess if you try to copy the index and then kill the process of tomcat then it might help. If still the index need to be delete you would have the back up. Next time always make back up. On Jun 4, 2014 7:55 PM, S.L simpleliving...@gmail.com wrote: Hi Folks, I recently started using the spellchecker in my solrconfig.xml. I am able to build up an index in Solr. But,if I ever shutdown tomcat I am not able to restart it.The server never spits out the server startup time in seconds in the logs,nor does it print any error messages in the catalina.out file. The only way for me to get around this is by delete the data directory of the index and then start the server,obviously this makes me loose my index. Just wondering if anyone faced a similar issue and if they were able to solve this. Thanks.
Re: DirectSpellChecker not returning expected suggestions.
Anyone ? On Sat, May 31, 2014 at 12:33 AM, S.L simpleliving...@gmail.com wrote: Hi All, I have a small test index of 400 documents , it happens to have an entry for wrangler, When I search for wranglr, I correctly get the collation suggestion as wrangler, however when I search for wrangle , I do not get a suggestion for wrangler. The Levenstien distance between wrangle -- wrangler is same as the Levestien distance between wranglr--wrangler , I am just wondering why I do not get a suggestion for wrangle. Below is my Direct spell checker configuration. lst name=spellchecker str name=namedirect/str str name=fieldsuggestAggregate/str str name=classnamesolr.DirectSolrSpellChecker/str !-- the spellcheck distance measure used, the default is the internal levenshtein -- str name=distanceMeasureinternal/str str name=comparatorClassscore/str !-- minimum accuracy needed to be considered a valid spellcheck suggestion -- float name=accuracy0.7/float !-- the maximum #edits we consider when enumerating terms: can be 1 or 2 -- int name=maxEdits1/int !-- the minimum shared prefix when enumerating terms -- int name=minPrefix3/int !-- maximum number of inspections per result. -- int name=maxInspections5/int !-- minimum length of a query term to be considered for correction -- int name=minQueryLength4/int !-- maximum threshold of documents a query term can appear to be considered for correction -- float name=maxQueryFrequency0.01/float !-- uncomment this to require suggestions to occur in 1% of the documents -- !-- float name=thresholdTokenFrequency.01/float -- /lst
Re: DirectSpellChecker not returning expected suggestions.
I do not get any suggestion (when I search for wrangle) , however I correctly get the suggestion wrangler when I search for wranglr , I am using the Direct and WordBreak spellcheckers in combination, I have not tried using anything else. Is the distance calculation of Solr different than what Levestien distance calculation ? I have set maxEdits to 1 , assuming that this corresponds to the maxDistance. Thanks for your help! On Mon, Jun 2, 2014 at 1:54 PM, david.w.smi...@gmail.com david.w.smi...@gmail.com wrote: What do you get then? Suggestions, but not the one you’re looking for, or is it deemed correctly spelled? Have you tried another spellChecker impl, for troubleshooting purposes? ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Sat, May 31, 2014 at 12:33 AM, S.L simpleliving...@gmail.com wrote: Hi All, I have a small test index of 400 documents , it happens to have an entry for wrangler, When I search for wranglr, I correctly get the collation suggestion as wrangler, however when I search for wrangle , I do not get a suggestion for wrangler. The Levenstien distance between wrangle -- wrangler is same as the Levestien distance between wranglr--wrangler , I am just wondering why I do not get a suggestion for wrangle. Below is my Direct spell checker configuration. lst name=spellchecker str name=namedirect/str str name=fieldsuggestAggregate/str str name=classnamesolr.DirectSolrSpellChecker/str !-- the spellcheck distance measure used, the default is the internal levenshtein -- str name=distanceMeasureinternal/str str name=comparatorClassscore/str !-- minimum accuracy needed to be considered a valid spellcheck suggestion -- float name=accuracy0.7/float !-- the maximum #edits we consider when enumerating terms: can be 1 or 2 -- int name=maxEdits1/int !-- the minimum shared prefix when enumerating terms -- int name=minPrefix3/int !-- maximum number of inspections per result. -- int name=maxInspections5/int !-- minimum length of a query term to be considered for correction -- int name=minQueryLength4/int !-- maximum threshold of documents a query term can appear to be considered for correction -- float name=maxQueryFrequency0.01/float !-- uncomment this to require suggestions to occur in 1% of the documents -- !-- float name=thresholdTokenFrequency.01/float -- /lst
Re: DirectSpellChecker not returning expected suggestions.
OK, I just realized that wrangle is a proper english word, probably thats why I dont get a suggestion for wrangler in this case. How ever in my test index there is no wrangle present , so even though this is a proper english word , since there is no occurence of it in the index should'nt Solr suggest me wrangler ? On Mon, Jun 2, 2014 at 2:00 PM, S.L simpleliving...@gmail.com wrote: I do not get any suggestion (when I search for wrangle) , however I correctly get the suggestion wrangler when I search for wranglr , I am using the Direct and WordBreak spellcheckers in combination, I have not tried using anything else. Is the distance calculation of Solr different than what Levestien distance calculation ? I have set maxEdits to 1 , assuming that this corresponds to the maxDistance. Thanks for your help! On Mon, Jun 2, 2014 at 1:54 PM, david.w.smi...@gmail.com david.w.smi...@gmail.com wrote: What do you get then? Suggestions, but not the one you’re looking for, or is it deemed correctly spelled? Have you tried another spellChecker impl, for troubleshooting purposes? ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Sat, May 31, 2014 at 12:33 AM, S.L simpleliving...@gmail.com wrote: Hi All, I have a small test index of 400 documents , it happens to have an entry for wrangler, When I search for wranglr, I correctly get the collation suggestion as wrangler, however when I search for wrangle , I do not get a suggestion for wrangler. The Levenstien distance between wrangle -- wrangler is same as the Levestien distance between wranglr--wrangler , I am just wondering why I do not get a suggestion for wrangle. Below is my Direct spell checker configuration. lst name=spellchecker str name=namedirect/str str name=fieldsuggestAggregate/str str name=classnamesolr.DirectSolrSpellChecker/str !-- the spellcheck distance measure used, the default is the internal levenshtein -- str name=distanceMeasureinternal/str str name=comparatorClassscore/str !-- minimum accuracy needed to be considered a valid spellcheck suggestion -- float name=accuracy0.7/float !-- the maximum #edits we consider when enumerating terms: can be 1 or 2 -- int name=maxEdits1/int !-- the minimum shared prefix when enumerating terms -- int name=minPrefix3/int !-- maximum number of inspections per result. -- int name=maxInspections5/int !-- minimum length of a query term to be considered for correction -- int name=minQueryLength4/int !-- maximum threshold of documents a query term can appear to be considered for correction -- float name=maxQueryFrequency0.01/float !-- uncomment this to require suggestions to occur in 1% of the documents -- !-- float name=thresholdTokenFrequency.01/float -- /lst
Re: DirectSpellChecker not returning expected suggestions.
Thanks, you mean wrangler , has been stemmed to wrangle , if thats the case then why does it not return any results for wrangle ? On Mon, Jun 2, 2014 at 2:07 PM, david.w.smi...@gmail.com david.w.smi...@gmail.com wrote: It appears to be stemmed. ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Mon, Jun 2, 2014 at 2:06 PM, S.L simpleliving...@gmail.com wrote: OK, I just realized that wrangle is a proper english word, probably thats why I dont get a suggestion for wrangler in this case. How ever in my test index there is no wrangle present , so even though this is a proper english word , since there is no occurence of it in the index should'nt Solr suggest me wrangler ? On Mon, Jun 2, 2014 at 2:00 PM, S.L simpleliving...@gmail.com wrote: I do not get any suggestion (when I search for wrangle) , however I correctly get the suggestion wrangler when I search for wranglr , I am using the Direct and WordBreak spellcheckers in combination, I have not tried using anything else. Is the distance calculation of Solr different than what Levestien distance calculation ? I have set maxEdits to 1 , assuming that this corresponds to the maxDistance. Thanks for your help! On Mon, Jun 2, 2014 at 1:54 PM, david.w.smi...@gmail.com david.w.smi...@gmail.com wrote: What do you get then? Suggestions, but not the one you’re looking for, or is it deemed correctly spelled? Have you tried another spellChecker impl, for troubleshooting purposes? ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Sat, May 31, 2014 at 12:33 AM, S.L simpleliving...@gmail.com wrote: Hi All, I have a small test index of 400 documents , it happens to have an entry for wrangler, When I search for wranglr, I correctly get the collation suggestion as wrangler, however when I search for wrangle , I do not get a suggestion for wrangler. The Levenstien distance between wrangle -- wrangler is same as the Levestien distance between wranglr--wrangler , I am just wondering why I do not get a suggestion for wrangle. Below is my Direct spell checker configuration. lst name=spellchecker str name=namedirect/str str name=fieldsuggestAggregate/str str name=classnamesolr.DirectSolrSpellChecker/str !-- the spellcheck distance measure used, the default is the internal levenshtein -- str name=distanceMeasureinternal/str str name=comparatorClassscore/str !-- minimum accuracy needed to be considered a valid spellcheck suggestion -- float name=accuracy0.7/float !-- the maximum #edits we consider when enumerating terms: can be 1 or 2 -- int name=maxEdits1/int !-- the minimum shared prefix when enumerating terms -- int name=minPrefix3/int !-- maximum number of inspections per result. -- int name=maxInspections5/int !-- minimum length of a query term to be considered for correction -- int name=minQueryLength4/int !-- maximum threshold of documents a query term can appear to be considered for correction -- float name=maxQueryFrequency0.01/float !-- uncomment this to require suggestions to occur in 1% of the documents -- !-- float name=thresholdTokenFrequency.01/float -- /lst
Re: DirectSpellChecker not returning expected suggestions.
James, I get no results back and no suggestions for wrangle , however I get suggestions for wranglr , and wrangle is not present in my index. I am just searching for wrangle in a field that is created by copying other fields, as to how it is analyzed I dont have access to it now. Thanks. On Mon, Jun 2, 2014 at 2:48 PM, Dyer, James james.d...@ingramcontent.com wrote: If wrangle is not in your index, and if it is within the max # of edits, then it should suggest it. Are you getting anything back from spellcheck at all? What is the exact query you are using? How is the spellcheck field analyzed? If you're using stemming, then wrangle and wrangler might be stemmed to the same word. (by the way, you shouldn't spellcheck against a stemmed or otherwise heavily-analyzed field). James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: Monday, June 02, 2014 1:06 PM To: solr-user@lucene.apache.org Subject: Re: DirectSpellChecker not returning expected suggestions. OK, I just realized that wrangle is a proper english word, probably thats why I dont get a suggestion for wrangler in this case. How ever in my test index there is no wrangle present , so even though this is a proper english word , since there is no occurence of it in the index should'nt Solr suggest me wrangler ? On Mon, Jun 2, 2014 at 2:00 PM, S.L simpleliving...@gmail.com wrote: I do not get any suggestion (when I search for wrangle) , however I correctly get the suggestion wrangler when I search for wranglr , I am using the Direct and WordBreak spellcheckers in combination, I have not tried using anything else. Is the distance calculation of Solr different than what Levestien distance calculation ? I have set maxEdits to 1 , assuming that this corresponds to the maxDistance. Thanks for your help! On Mon, Jun 2, 2014 at 1:54 PM, david.w.smi...@gmail.com david.w.smi...@gmail.com wrote: What do you get then? Suggestions, but not the one you’re looking for, or is it deemed correctly spelled? Have you tried another spellChecker impl, for troubleshooting purposes? ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Sat, May 31, 2014 at 12:33 AM, S.L simpleliving...@gmail.com wrote: Hi All, I have a small test index of 400 documents , it happens to have an entry for wrangler, When I search for wranglr, I correctly get the collation suggestion as wrangler, however when I search for wrangle , I do not get a suggestion for wrangler. The Levenstien distance between wrangle -- wrangler is same as the Levestien distance between wranglr--wrangler , I am just wondering why I do not get a suggestion for wrangle. Below is my Direct spell checker configuration. lst name=spellchecker str name=namedirect/str str name=fieldsuggestAggregate/str str name=classnamesolr.DirectSolrSpellChecker/str !-- the spellcheck distance measure used, the default is the internal levenshtein -- str name=distanceMeasureinternal/str str name=comparatorClassscore/str !-- minimum accuracy needed to be considered a valid spellcheck suggestion -- float name=accuracy0.7/float !-- the maximum #edits we consider when enumerating terms: can be 1 or 2 -- int name=maxEdits1/int !-- the minimum shared prefix when enumerating terms -- int name=minPrefix3/int !-- maximum number of inspections per result. -- int name=maxInspections5/int !-- minimum length of a query term to be considered for correction -- int name=minQueryLength4/int !-- maximum threshold of documents a query term can appear to be considered for correction -- float name=maxQueryFrequency0.01/float !-- uncomment this to require suggestions to occur in 1% of the documents -- !-- float name=thresholdTokenFrequency.01/float -- /lst
Re: Wordbreak spellchecker excessive breaking.
-- !-- Result Window Size An optimization for use with the queryResultCache. When a search is requested, a superset of the requested number of document ids are collected. For example, if a search for a particular query requests matching documents 10 through 19, and queryWindowSize is 50, then documents 0 through 49 will be collected and cached. Any further requests in that range can be satisfied via the cache. -- queryResultWindowSize20/queryResultWindowSize !-- Maximum number of documents to cache for any entry in the queryResultCache. -- queryResultMaxDocsCached200/queryResultMaxDocsCached !-- Query Related Event Listeners Various IndexSearcher related events can trigger Listeners to take actions. newSearcher - fired whenever a new searcher is being prepared and there is a current searcher handling requests (aka registered). It can be used to prime certain caches to prevent long request times for certain requests. firstSearcher - fired whenever a new searcher is being prepared but there is no current registered searcher to handle requests or to gain autowarming data from. -- !-- QuerySenderListener takes an array of NamedList and executes a local query request for each NamedList in sequence. -- listener event=newSearcher class=solr.QuerySenderListener arr name=queries !-- lststr name=qsolr/strstr name=sortprice asc/str/lst lststr name=qrocks/strstr name=sortweight asc/str/lst -- /arr /listener listener event=firstSearcher class=solr.QuerySenderListener arr name=queries lst str name=qstatic firstSearcher warming in solrconfig.xml/str /lst /arr /listener !-- Use Cold Searcher If a search request comes in and there is no current registered searcher, then immediately register the still warming searcher and use it. If false then all requests will block until the first searcher is done warming. -- useColdSearcherfalse/useColdSearcher !-- Max Warming Searchers Maximum number of searchers that may be warming in the background concurrently. An error is returned if this limit is exceeded. Recommend values of 1-2 for read-only slaves, higher for masters w/o cache warming. -- maxWarmingSearchers2/maxWarmingSearchers /query On Fri, May 30, 2014 at 10:20 AM, Dyer, James james.d...@ingramcontent.com wrote: I am not sure why changing spellcheck parameters would prevent your server from restarting. One thing to check is to see if you have warming queries running that involve spellcheck. I think I remember from long ago there was (maybe still is) an obscure bug where sometimes it will lock up in rare cases when spellcheck is used in warming queries. I do not remember exactly what caused this or if it was ever fixed. Besides that, you might want to post a stack trace or describe what happens when it doesn't restart. Perhaps someone here will know what the problem is. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: Friday, May 30, 2014 12:36 AM To: solr-user@lucene.apache.org Subject: Re: Wordbreak spellchecker excessive breaking. James, Thanks for clearly stating this , I was not able to find this documented anywhere, yes I am using it with another spell checker (Direct) with the collation on. I will try the maxChangtes and let you know. On a side note , whenever I change the spellchecker parameter , I need to rebuild the index and delete the solr data directory before that as my Tomcat instance would not even start, can you let me know why ? Thanks. On Tue, May 27, 2014 at 12:21 PM, Dyer, James james.d...@ingramcontent.com wrote: You can do this if you set it up like in the mail Solr example: lst name=spellchecker str name=namewordbreak/str str name=classnamesolr.WordBreakSolrSpellChecker/str str name=fieldname/str str name=combineWordstrue/str str name=breakWordstrue/str int name=maxChanges10/int /lst The combineWords and breakWords flags let you tell it which kind of workbreak correction you want. maxChanges controls the maximum number of words it can break 1 word into, or the maximum number of words it can combine. It is reasonable to set this to 1 or 2. The best way to use this is in conjunction with a regular spellchecker like DirectSolrSpellChecker. When used together with the collation functionality, it should take a query like mob ile and depending on what actually returns results from your data, suggest either mobile or perhaps mob lie or both. The one thing is cannot do is fix
DirectSpellChecker not returning expected suggestions.
Hi All, I have a small test index of 400 documents , it happens to have an entry for wrangler, When I search for wranglr, I correctly get the collation suggestion as wrangler, however when I search for wrangle , I do not get a suggestion for wrangler. The Levenstien distance between wrangle -- wrangler is same as the Levestien distance between wranglr--wrangler , I am just wondering why I do not get a suggestion for wrangle. Below is my Direct spell checker configuration. lst name=spellchecker str name=namedirect/str str name=fieldsuggestAggregate/str str name=classnamesolr.DirectSolrSpellChecker/str !-- the spellcheck distance measure used, the default is the internal levenshtein -- str name=distanceMeasureinternal/str str name=comparatorClassscore/str !-- minimum accuracy needed to be considered a valid spellcheck suggestion -- float name=accuracy0.7/float !-- the maximum #edits we consider when enumerating terms: can be 1 or 2 -- int name=maxEdits1/int !-- the minimum shared prefix when enumerating terms -- int name=minPrefix3/int !-- maximum number of inspections per result. -- int name=maxInspections5/int !-- minimum length of a query term to be considered for correction -- int name=minQueryLength4/int !-- maximum threshold of documents a query term can appear to be considered for correction -- float name=maxQueryFrequency0.01/float !-- uncomment this to require suggestions to occur in 1% of the documents -- !-- float name=thresholdTokenFrequency.01/float -- /lst
Re: Wordbreak spellchecker excessive breaking.
James, Thanks for clearly stating this , I was not able to find this documented anywhere, yes I am using it with another spell checker (Direct) with the collation on. I will try the maxChangtes and let you know. On a side note , whenever I change the spellchecker parameter , I need to rebuild the index and delete the solr data directory before that as my Tomcat instance would not even start, can you let me know why ? Thanks. On Tue, May 27, 2014 at 12:21 PM, Dyer, James james.d...@ingramcontent.com wrote: You can do this if you set it up like in the mail Solr example: lst name=spellchecker str name=namewordbreak/str str name=classnamesolr.WordBreakSolrSpellChecker/str str name=fieldname/str str name=combineWordstrue/str str name=breakWordstrue/str int name=maxChanges10/int /lst The combineWords and breakWords flags let you tell it which kind of workbreak correction you want. maxChanges controls the maximum number of words it can break 1 word into, or the maximum number of words it can combine. It is reasonable to set this to 1 or 2. The best way to use this is in conjunction with a regular spellchecker like DirectSolrSpellChecker. When used together with the collation functionality, it should take a query like mob ile and depending on what actually returns results from your data, suggest either mobile or perhaps mob lie or both. The one thing is cannot do is fix a transposition or misspelling and combine or break words in one shot. That is, it cannot detect that mob lie should become mobile. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: Saturday, May 24, 2014 4:21 PM To: solr-user@lucene.apache.org Subject: Wordbreak spellchecker excessive breaking. I am using Solr wordbreak spellchecker and the issue is that when I search for a term like mob ile expecting that the wordbreak spellchecker would actually resutn a suggestion for mobile it breaks the search term into letters like m o b I have two issues with this behavior. 1. How can I make Solr combine mob ile to mobile? 2. Not withstanding the fact that my search term mob ile is being broken incorrectly into individual letters , I realize that the wordbreak is needed in certain cases, how do I control the wordbreak so that it does not break it into letters like m o b which seems like excessive breaking to me ? Thanks.
Re: Wordbreak spellchecker excessive breaking.
Anyone ? On Sat, May 24, 2014 at 5:21 PM, S.L simpleliving...@gmail.com wrote: I am using Solr wordbreak spellchecker and the issue is that when I search for a term like mob ile expecting that the wordbreak spellchecker would actually resutn a suggestion for mobile it breaks the search term into letters like m o b I have two issues with this behavior. 1. How can I make Solr combine mob ile to mobile? 2. Not withstanding the fact that my search term mob ile is being broken incorrectly into individual letters , I realize that the wordbreak is needed in certain cases, how do I control the wordbreak so that it does not break it into letters like m o b which seems like excessive breaking to me ? Thanks.
Wordbreak spellchecker excessive breaking.
I am using Solr wordbreak spellchecker and the issue is that when I search for a term like mob ile expecting that the wordbreak spellchecker would actually resutn a suggestion for mobile it breaks the search term into letters like m o b I have two issues with this behavior. 1. How can I make Solr combine mob ile to mobile? 2. Not withstanding the fact that my search term mob ile is being broken incorrectly into individual letters , I realize that the wordbreak is needed in certain cases, how do I control the wordbreak so that it does not break it into letters like m o b which seems like excessive breaking to me ? Thanks.
Apache Solr SpellChecker Integration with the default select request handler
Hello fellow Solr users, I am using the default select request handler to search a Solr core , I also use the eDismaxquery parser. 1. I want to integrate this with the spellchecker search component so that if a search request comes in the spellchecker component also gets called and I get a suggestion back with search results. 2. If the suggestion is above a certain threshold then I want the search to be made on that suggestion , otherwise the suggestion should comeback along with the search results for the original search term. In order to accomplish this it seems I need to integrate the SearchHandler.java class to call the spellchecker internally and then make a search call if the suggestion from the spellchecker has a suggestion that is above a certain threshold. I would really appreciate if there any examples of calling the SpellChecker component via the API in Solr that someone can share with me and also if you could validate my approach. Thank You.
Re: Apache Solr SpellChecker Integration with the default select request handler
Yes, I use solrJ , but only to index the data , the querying of the data happens usinf the default select query handler from a non java client. On Sat, Apr 12, 2014 at 12:12 PM, Furkan KAMACI furkankam...@gmail.comwrote: Hi; Do you use Solrj at your application? Why you did not consider to use to solve this with Solrj? Thanks; Furkan KAMACI 2014-04-12 18:34 GMT+03:00 S.L simpleliving...@gmail.com: Hello fellow Solr users, I am using the default select request handler to search a Solr core , I also use the eDismaxquery parser. 1. I want to integrate this with the spellchecker search component so that if a search request comes in the spellchecker component also gets called and I get a suggestion back with search results. 2. If the suggestion is above a certain threshold then I want the search to be made on that suggestion , otherwise the suggestion should comeback along with the search results for the original search term. In order to accomplish this it seems I need to integrate the SearchHandler.java class to call the spellchecker internally and then make a search call if the suggestion from the spellchecker has a suggestion that is above a certain threshold. I would really appreciate if there any examples of calling the SpellChecker component via the API in Solr that someone can share with me and also if you could validate my approach. Thank You.
Re: Apache Solr SpellChecker Integration with the default select request handler
Furkan, I am not sure how this could be a security concern, what I am actually asking is an approach to integrate the spellchecker search component within the default request handler. Thanks. On Sat, Apr 12, 2014 at 5:38 PM, Furkan KAMACI furkankam...@gmail.comwrote: Hi; I do not want to change the direction of your question but it is really good, secure and flexible to do such kind of things at your client (a java client or not). On the other *if *you let people to access your Solr instance directly it causes some security issues. Thanks; Furkan KAMACI 2014-04-12 19:26 GMT+03:00 S.L simpleliving...@gmail.com: Yes, I use solrJ , but only to index the data , the querying of the data happens usinf the default select query handler from a non java client. On Sat, Apr 12, 2014 at 12:12 PM, Furkan KAMACI furkankam...@gmail.com wrote: Hi; Do you use Solrj at your application? Why you did not consider to use to solve this with Solrj? Thanks; Furkan KAMACI 2014-04-12 18:34 GMT+03:00 S.L simpleliving...@gmail.com: Hello fellow Solr users, I am using the default select request handler to search a Solr core , I also use the eDismaxquery parser. 1. I want to integrate this with the spellchecker search component so that if a search request comes in the spellchecker component also gets called and I get a suggestion back with search results. 2. If the suggestion is above a certain threshold then I want the search to be made on that suggestion , otherwise the suggestion should comeback along with the search results for the original search term. In order to accomplish this it seems I need to integrate the SearchHandler.java class to call the spellchecker internally and then make a search call if the suggestion from the spellchecker has a suggestion that is above a certain threshold. I would really appreciate if there any examples of calling the SpellChecker component via the API in Solr that someone can share with me and also if you could validate my approach. Thank You.
Combining eDismax and SpellChecker
Hi All, I want to suggest the correct phrase if a typo is made while searching and then search it using eDismax parser(pf,pf2,pf3), if no typo is made then search it using eDismax parser alone. Is there a way I can combine these two components , I have seen examples for eDismax and also for SpellChecker , but nothing that combines these two together. Can you please let me know ? Thanks.
Re: eDismax parser and the mm parameter
Ahmet, SpellChecker seems to be the the exact thing that I need for fuzzy type search , how can I combine SpellChecker with something like edismax parser to make use of paramerters like pf,pf2 and pf3 . Is there any resource that you can point me to do that ? Thanks. On Wed, Apr 2, 2014 at 9:12 PM, S.L simpleliving...@gmail.com wrote: Thanks Ahmet, I would definitely look into this . I appreciate that. On Wed, Apr 2, 2014 at 7:47 PM, Ahmet Arslan iori...@yahoo.com wrote: Yes, it has spellcheck.collate parameter. I mean it has lots of parameters and with correct combination of parameters it can suggest White Siberian Ginseng from Whte Sberia Ginsng https://cwiki.apache.org/confluence/display/solr/Spell+Checking On Thursday, April 3, 2014 1:57 AM, simpleliving...@gmail.com simpleliving...@gmail.com wrote: Ahmet. Thanks I will look into this option . Does spellchecker support multiple word search terms? Sent from my HTC - Reply message - From: Ahmet Arslan iori...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Subject: eDismax parser and the mm parameter Date: Wed, Apr 2, 2014 10:53 AM Hi SL, Instead of fuzzy queries, can't you use spell checker? Generally Spell Checker (a.k.a did you mean) is a preferred tool for typos. Ahmet On Wednesday, April 2, 2014 4:13 PM, simpleliving...@gmail.com simpleliving...@gmail.com wrote: It only works for a single word search term and not multiple word search term. Sent from my HTC - Reply message - From: William Bell billnb...@gmail.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Subject: eDismax parser and the mm parameter Date: Wed, Apr 2, 2014 12:03 AM Fuzzy is provided use ~ On Mon, Mar 31, 2014 at 11:04 PM, S.L simpleliving...@gmail.com wrote: Jack , Thanks a lot , I am now using the pf ,pf2 an pf3 and have gotten rid of the mm parameter from my queries, however for the fuzzy phrase queries , I am not sure how I would be able to leverage the Complex Query Parser there is absolutely nothing out there that gives me any idea as to how to do that . Why is fuzzy phrase search not provided by Solr OOB ? I am surprised Thanks. On Mon, Mar 31, 2014 at 5:39 AM, Jack Krupansky j...@basetechnology.com wrote: The pf, pf2, and pf3 parameters should cover cases 1 and 2. Use q.op=OR (the default) and ignore the mm parameter. Give pf the highest boost, and boost pf3 higher than pf2. You could try using the complex phrase query parser for the third case. -- Jack Krupansky -Original Message- From: S.L Sent: Monday, March 31, 2014 12:08 AM To: solr-user@lucene.apache.org Subject: Re: eDismax parser and the mm parameter Thanks Jack , my use cases are as follows. 1. Search for Ginseng everything related to ginseng should show up. 2. Search For White Siberian Ginseng results with the whole phrase show up first followed by 2 words from the phrase followed by a single word in the phrase 3. Fuzzy Search Whte Sberia Ginsng (please note the typos here) documents with White Siberian Ginseng Should show up , this looks like the most complicated of all as Solr does not support fuzzy phrase searches . (I have no solution for this yet). Thanks again! On Sun, Mar 30, 2014 at 11:21 PM, Jack Krupansky j...@basetechnology.com wrote: The mm parameter is really only relevant when the default operator is OR or explicit OR operators are used. Again: Please provide your use case examples and your expectations for each use case. It really doesn't make a lot of sense to prematurely focus on a solution when you haven't clearly defined your use cases. -- Jack Krupansky -Original Message- From: S.L Sent: Sunday, March 30, 2014 9:13 PM To: solr-user@lucene.apache.org Subject: Re: eDismax parser and the mm parameter Jack, I mis-stated the problem , I am not using the OR operator as default now(now that I think about it it does not make sense to use the default operator OR along with the mm parameter) , the reason I want to use pf and mm in conjunction is because of my understanding of the edismax parser and I have not looked into pf2 and pf3 parameters yet. I will state my understanding here below. Pf - Is used to boost the result score if the complete phrase matches. mm (less than) search term length would help limit the query results to a certain number of better matches. With that being said would it make sense to have dynamic mm (set to the length of search term - 1)? I also have a question around using a fuzzy search along with eDismax parser , but I will ask that in a seperate post once I go thru that aspect of eDismax parser. Thanks again ! On Sun, Mar 30, 2014 at 6
Re: eDismax parser and the mm parameter
Thanks Ahmet, I would definitely look into this . I appreciate that. On Wed, Apr 2, 2014 at 7:47 PM, Ahmet Arslan iori...@yahoo.com wrote: Yes, it has spellcheck.collate parameter. I mean it has lots of parameters and with correct combination of parameters it can suggest White Siberian Ginseng from Whte Sberia Ginsng https://cwiki.apache.org/confluence/display/solr/Spell+Checking On Thursday, April 3, 2014 1:57 AM, simpleliving...@gmail.com simpleliving...@gmail.com wrote: Ahmet. Thanks I will look into this option . Does spellchecker support multiple word search terms? Sent from my HTC - Reply message - From: Ahmet Arslan iori...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Subject: eDismax parser and the mm parameter Date: Wed, Apr 2, 2014 10:53 AM Hi SL, Instead of fuzzy queries, can't you use spell checker? Generally Spell Checker (a.k.a did you mean) is a preferred tool for typos. Ahmet On Wednesday, April 2, 2014 4:13 PM, simpleliving...@gmail.com simpleliving...@gmail.com wrote: It only works for a single word search term and not multiple word search term. Sent from my HTC - Reply message - From: William Bell billnb...@gmail.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Subject: eDismax parser and the mm parameter Date: Wed, Apr 2, 2014 12:03 AM Fuzzy is provided use ~ On Mon, Mar 31, 2014 at 11:04 PM, S.L simpleliving...@gmail.com wrote: Jack , Thanks a lot , I am now using the pf ,pf2 an pf3 and have gotten rid of the mm parameter from my queries, however for the fuzzy phrase queries , I am not sure how I would be able to leverage the Complex Query Parser there is absolutely nothing out there that gives me any idea as to how to do that . Why is fuzzy phrase search not provided by Solr OOB ? I am surprised Thanks. On Mon, Mar 31, 2014 at 5:39 AM, Jack Krupansky j...@basetechnology.com wrote: The pf, pf2, and pf3 parameters should cover cases 1 and 2. Use q.op=OR (the default) and ignore the mm parameter. Give pf the highest boost, and boost pf3 higher than pf2. You could try using the complex phrase query parser for the third case. -- Jack Krupansky -Original Message- From: S.L Sent: Monday, March 31, 2014 12:08 AM To: solr-user@lucene.apache.org Subject: Re: eDismax parser and the mm parameter Thanks Jack , my use cases are as follows. 1. Search for Ginseng everything related to ginseng should show up. 2. Search For White Siberian Ginseng results with the whole phrase show up first followed by 2 words from the phrase followed by a single word in the phrase 3. Fuzzy Search Whte Sberia Ginsng (please note the typos here) documents with White Siberian Ginseng Should show up , this looks like the most complicated of all as Solr does not support fuzzy phrase searches . (I have no solution for this yet). Thanks again! On Sun, Mar 30, 2014 at 11:21 PM, Jack Krupansky j...@basetechnology.com wrote: The mm parameter is really only relevant when the default operator is OR or explicit OR operators are used. Again: Please provide your use case examples and your expectations for each use case. It really doesn't make a lot of sense to prematurely focus on a solution when you haven't clearly defined your use cases. -- Jack Krupansky -Original Message- From: S.L Sent: Sunday, March 30, 2014 9:13 PM To: solr-user@lucene.apache.org Subject: Re: eDismax parser and the mm parameter Jack, I mis-stated the problem , I am not using the OR operator as default now(now that I think about it it does not make sense to use the default operator OR along with the mm parameter) , the reason I want to use pf and mm in conjunction is because of my understanding of the edismax parser and I have not looked into pf2 and pf3 parameters yet. I will state my understanding here below. Pf - Is used to boost the result score if the complete phrase matches. mm (less than) search term length would help limit the query results to a certain number of better matches. With that being said would it make sense to have dynamic mm (set to the length of search term - 1)? I also have a question around using a fuzzy search along with eDismax parser , but I will ask that in a seperate post once I go thru that aspect of eDismax parser. Thanks again ! On Sun, Mar 30, 2014 at 6:44 PM, Jack Krupansky j...@basetechnology.com wrote: If you use pf, pf2, and pf3 and boost appropriately, the effects of mm will be dwarfed. The general goal is to assure that the top documents really are the best, not to necessarily limit the total document count. Focusing on the latter could be a real waste
Re: eDismax parser and the mm parameter
Jack , Thanks a lot , I am now using the pf ,pf2 an pf3 and have gotten rid of the mm parameter from my queries, however for the fuzzy phrase queries , I am not sure how I would be able to leverage the Complex Query Parser there is absolutely nothing out there that gives me any idea as to how to do that . Why is fuzzy phrase search not provided by Solr OOB ? I am surprised Thanks. On Mon, Mar 31, 2014 at 5:39 AM, Jack Krupansky j...@basetechnology.comwrote: The pf, pf2, and pf3 parameters should cover cases 1 and 2. Use q.op=OR (the default) and ignore the mm parameter. Give pf the highest boost, and boost pf3 higher than pf2. You could try using the complex phrase query parser for the third case. -- Jack Krupansky -Original Message- From: S.L Sent: Monday, March 31, 2014 12:08 AM To: solr-user@lucene.apache.org Subject: Re: eDismax parser and the mm parameter Thanks Jack , my use cases are as follows. 1. Search for Ginseng everything related to ginseng should show up. 2. Search For White Siberian Ginseng results with the whole phrase show up first followed by 2 words from the phrase followed by a single word in the phrase 3. Fuzzy Search Whte Sberia Ginsng (please note the typos here) documents with White Siberian Ginseng Should show up , this looks like the most complicated of all as Solr does not support fuzzy phrase searches . (I have no solution for this yet). Thanks again! On Sun, Mar 30, 2014 at 11:21 PM, Jack Krupansky j...@basetechnology.com wrote: The mm parameter is really only relevant when the default operator is OR or explicit OR operators are used. Again: Please provide your use case examples and your expectations for each use case. It really doesn't make a lot of sense to prematurely focus on a solution when you haven't clearly defined your use cases. -- Jack Krupansky -Original Message- From: S.L Sent: Sunday, March 30, 2014 9:13 PM To: solr-user@lucene.apache.org Subject: Re: eDismax parser and the mm parameter Jack, I mis-stated the problem , I am not using the OR operator as default now(now that I think about it it does not make sense to use the default operator OR along with the mm parameter) , the reason I want to use pf and mm in conjunction is because of my understanding of the edismax parser and I have not looked into pf2 and pf3 parameters yet. I will state my understanding here below. Pf - Is used to boost the result score if the complete phrase matches. mm (less than) search term length would help limit the query results to a certain number of better matches. With that being said would it make sense to have dynamic mm (set to the length of search term - 1)? I also have a question around using a fuzzy search along with eDismax parser , but I will ask that in a seperate post once I go thru that aspect of eDismax parser. Thanks again ! On Sun, Mar 30, 2014 at 6:44 PM, Jack Krupansky j...@basetechnology.com wrote: If you use pf, pf2, and pf3 and boost appropriately, the effects of mm will be dwarfed. The general goal is to assure that the top documents really are the best, not to necessarily limit the total document count. Focusing on the latter could be a real waste of time. It's still not clear why or how you need or want to use OR as the default operator - you still haven't given us a use case for that. To repeat: Give us a full set of use cases before taking this XY Problem approach of pursuing a solution before the problem is understood. -- Jack Krupansky -Original Message- From: S.L Sent: Sunday, March 30, 2014 6:14 PM To: solr-user@lucene.apache.org Subject: Re: eDismax parser and the mm parameter Jacks Thanks Again, I am searching Chinese medicine documents , as the example I gave earlier a user can search for Ginseng or Siberian Ginseng or Red Siberian Ginseng , I certainly want to use pf parameter (which is not driven by mm parameter) , however for giving higher score to documents that have more of the terms I want to use edismax now if I give a mm of 3 and the search term is of only length 1 (like Ginseng) what does edisMax do ? On Sun, Mar 30, 2014 at 1:21 PM, Jack Krupansky j...@basetechnology.com wrote: It still depends on your objective - which you haven't told us yet. Show us some use cases and detail what your expectations are for each use case. The edismax phrase boosting is probably a lot more useful than messing around with mm. Take a look at pf, pf2, and pf3. See: http://wiki.apache.org/solr/ExtendedDisMax https://cwiki.apache.org/confluence/display/solr/The+ Extended+DisMax+Query+Parser The focus on mm may indeed be a classic XY Problem - a premature focus on a solution without detailing the problem. -- Jack Krupansky -Original Message- From: S.L Sent: Sunday, March 30, 2014 11:18 AM To: solr-user@lucene.apache.org Subject: Re: eDismax parser and the mm
eDismax parser and the mm parameter
Hi All, I am planning to use the eDismax query parser in SOLR to give boost to documents that have a phrase in their fields present. Now there is a mm parameter in the edismax parser query , since the query typed by the user could be of any length (i.e. =1) I would like to set the mm value to 1 . I have the following questions regarding this parameter. 1. Is it set to 1 by default ? 2. In my schema.xml the defaultOperator is set to AND should I set it to OR inorder for the edismax parser to be effective with a mm of 1? Thanks in advance!
Re: eDismax parser and the mm parameter
Thanks Jack! I understand the intent of mm parameter, my question is that since the query terms being provided are not of fixed length I do not know what the mm should like for example Ginseng,Siberian Ginseng are my search terms. The first one can have an mm upto 1 and the second one can have an mm of upto 2 . Should I dynamically set the mm based on the number of search terms in my query ? Thanks again. On Sun, Mar 30, 2014 at 5:20 AM, Jack Krupansky j...@basetechnology.comwrote: 1. Yes, the default for mm is 1. 2. It depends on what you are really trying to do - you haven't told us. Generally, mm=1 is equivalent to q.op=OR, and mm=100% is equivalent to q.op=AND. Generally, use q.op unless you really know what you are doing. Generally, the intent of mm is to set the minimum number of OR/SHOULD clauses that must match on the top level of a query. -- Jack Krupansky -Original Message- From: S.L Sent: Sunday, March 30, 2014 2:25 AM To: solr-user@lucene.apache.org Subject: eDismax parser and the mm parameter Hi All, I am planning to use the eDismax query parser in SOLR to give boost to documents that have a phrase in their fields present. Now there is a mm parameter in the edismax parser query , since the query typed by the user could be of any length (i.e. =1) I would like to set the mm value to 1 . I have the following questions regarding this parameter. 1. Is it set to 1 by default ? 2. In my schema.xml the defaultOperator is set to AND should I set it to OR inorder for the edismax parser to be effective with a mm of 1? Thanks in advance!
Re: eDismax parser and the mm parameter
Jacks Thanks Again, I am searching Chinese medicine documents , as the example I gave earlier a user can search for Ginseng or Siberian Ginseng or Red Siberian Ginseng , I certainly want to use pf parameter (which is not driven by mm parameter) , however for giving higher score to documents that have more of the terms I want to use edismax now if I give a mm of 3 and the search term is of only length 1 (like Ginseng) what does edisMax do ? On Sun, Mar 30, 2014 at 1:21 PM, Jack Krupansky j...@basetechnology.comwrote: It still depends on your objective - which you haven't told us yet. Show us some use cases and detail what your expectations are for each use case. The edismax phrase boosting is probably a lot more useful than messing around with mm. Take a look at pf, pf2, and pf3. See: http://wiki.apache.org/solr/ExtendedDisMax https://cwiki.apache.org/confluence/display/solr/The+ Extended+DisMax+Query+Parser The focus on mm may indeed be a classic XY Problem - a premature focus on a solution without detailing the problem. -- Jack Krupansky -Original Message- From: S.L Sent: Sunday, March 30, 2014 11:18 AM To: solr-user@lucene.apache.org Subject: Re: eDismax parser and the mm parameter Thanks Jack! I understand the intent of mm parameter, my question is that since the query terms being provided are not of fixed length I do not know what the mm should like for example Ginseng,Siberian Ginseng are my search terms. The first one can have an mm upto 1 and the second one can have an mm of upto 2 . Should I dynamically set the mm based on the number of search terms in my query ? Thanks again. On Sun, Mar 30, 2014 at 5:20 AM, Jack Krupansky j...@basetechnology.com wrote: 1. Yes, the default for mm is 1. 2. It depends on what you are really trying to do - you haven't told us. Generally, mm=1 is equivalent to q.op=OR, and mm=100% is equivalent to q.op=AND. Generally, use q.op unless you really know what you are doing. Generally, the intent of mm is to set the minimum number of OR/SHOULD clauses that must match on the top level of a query. -- Jack Krupansky -Original Message- From: S.L Sent: Sunday, March 30, 2014 2:25 AM To: solr-user@lucene.apache.org Subject: eDismax parser and the mm parameter Hi All, I am planning to use the eDismax query parser in SOLR to give boost to documents that have a phrase in their fields present. Now there is a mm parameter in the edismax parser query , since the query typed by the user could be of any length (i.e. =1) I would like to set the mm value to 1 . I have the following questions regarding this parameter. 1. Is it set to 1 by default ? 2. In my schema.xml the defaultOperator is set to AND should I set it to OR inorder for the edismax parser to be effective with a mm of 1? Thanks in advance!
Re: eDismax parser and the mm parameter
Jack, I mis-stated the problem , I am not using the OR operator as default now(now that I think about it it does not make sense to use the default operator OR along with the mm parameter) , the reason I want to use pf and mm in conjunction is because of my understanding of the edismax parser and I have not looked into pf2 and pf3 parameters yet. I will state my understanding here below. Pf - Is used to boost the result score if the complete phrase matches. mm (less than) search term length would help limit the query results to a certain number of better matches. With that being said would it make sense to have dynamic mm (set to the length of search term - 1)? I also have a question around using a fuzzy search along with eDismax parser , but I will ask that in a seperate post once I go thru that aspect of eDismax parser. Thanks again ! On Sun, Mar 30, 2014 at 6:44 PM, Jack Krupansky j...@basetechnology.comwrote: If you use pf, pf2, and pf3 and boost appropriately, the effects of mm will be dwarfed. The general goal is to assure that the top documents really are the best, not to necessarily limit the total document count. Focusing on the latter could be a real waste of time. It's still not clear why or how you need or want to use OR as the default operator - you still haven't given us a use case for that. To repeat: Give us a full set of use cases before taking this XY Problem approach of pursuing a solution before the problem is understood. -- Jack Krupansky -Original Message- From: S.L Sent: Sunday, March 30, 2014 6:14 PM To: solr-user@lucene.apache.org Subject: Re: eDismax parser and the mm parameter Jacks Thanks Again, I am searching Chinese medicine documents , as the example I gave earlier a user can search for Ginseng or Siberian Ginseng or Red Siberian Ginseng , I certainly want to use pf parameter (which is not driven by mm parameter) , however for giving higher score to documents that have more of the terms I want to use edismax now if I give a mm of 3 and the search term is of only length 1 (like Ginseng) what does edisMax do ? On Sun, Mar 30, 2014 at 1:21 PM, Jack Krupansky j...@basetechnology.com wrote: It still depends on your objective - which you haven't told us yet. Show us some use cases and detail what your expectations are for each use case. The edismax phrase boosting is probably a lot more useful than messing around with mm. Take a look at pf, pf2, and pf3. See: http://wiki.apache.org/solr/ExtendedDisMax https://cwiki.apache.org/confluence/display/solr/The+ Extended+DisMax+Query+Parser The focus on mm may indeed be a classic XY Problem - a premature focus on a solution without detailing the problem. -- Jack Krupansky -Original Message- From: S.L Sent: Sunday, March 30, 2014 11:18 AM To: solr-user@lucene.apache.org Subject: Re: eDismax parser and the mm parameter Thanks Jack! I understand the intent of mm parameter, my question is that since the query terms being provided are not of fixed length I do not know what the mm should like for example Ginseng,Siberian Ginseng are my search terms. The first one can have an mm upto 1 and the second one can have an mm of upto 2 . Should I dynamically set the mm based on the number of search terms in my query ? Thanks again. On Sun, Mar 30, 2014 at 5:20 AM, Jack Krupansky j...@basetechnology.com wrote: 1. Yes, the default for mm is 1. 2. It depends on what you are really trying to do - you haven't told us. Generally, mm=1 is equivalent to q.op=OR, and mm=100% is equivalent to q.op=AND. Generally, use q.op unless you really know what you are doing. Generally, the intent of mm is to set the minimum number of OR/SHOULD clauses that must match on the top level of a query. -- Jack Krupansky -Original Message- From: S.L Sent: Sunday, March 30, 2014 2:25 AM To: solr-user@lucene.apache.org Subject: eDismax parser and the mm parameter Hi All, I am planning to use the eDismax query parser in SOLR to give boost to documents that have a phrase in their fields present. Now there is a mm parameter in the edismax parser query , since the query typed by the user could be of any length (i.e. =1) I would like to set the mm value to 1 . I have the following questions regarding this parameter. 1. Is it set to 1 by default ? 2. In my schema.xml the defaultOperator is set to AND should I set it to OR inorder for the edismax parser to be effective with a mm of 1? Thanks in advance!
Re: eDismax parser and the mm parameter
Thanks Jack , my use cases are as follows. 1. Search for Ginseng everything related to ginseng should show up. 2. Search For White Siberian Ginseng results with the whole phrase show up first followed by 2 words from the phrase followed by a single word in the phrase 3. Fuzzy Search Whte Sberia Ginsng (please note the typos here) documents with White Siberian Ginseng Should show up , this looks like the most complicated of all as Solr does not support fuzzy phrase searches . (I have no solution for this yet). Thanks again! On Sun, Mar 30, 2014 at 11:21 PM, Jack Krupansky j...@basetechnology.comwrote: The mm parameter is really only relevant when the default operator is OR or explicit OR operators are used. Again: Please provide your use case examples and your expectations for each use case. It really doesn't make a lot of sense to prematurely focus on a solution when you haven't clearly defined your use cases. -- Jack Krupansky -Original Message- From: S.L Sent: Sunday, March 30, 2014 9:13 PM To: solr-user@lucene.apache.org Subject: Re: eDismax parser and the mm parameter Jack, I mis-stated the problem , I am not using the OR operator as default now(now that I think about it it does not make sense to use the default operator OR along with the mm parameter) , the reason I want to use pf and mm in conjunction is because of my understanding of the edismax parser and I have not looked into pf2 and pf3 parameters yet. I will state my understanding here below. Pf - Is used to boost the result score if the complete phrase matches. mm (less than) search term length would help limit the query results to a certain number of better matches. With that being said would it make sense to have dynamic mm (set to the length of search term - 1)? I also have a question around using a fuzzy search along with eDismax parser , but I will ask that in a seperate post once I go thru that aspect of eDismax parser. Thanks again ! On Sun, Mar 30, 2014 at 6:44 PM, Jack Krupansky j...@basetechnology.com wrote: If you use pf, pf2, and pf3 and boost appropriately, the effects of mm will be dwarfed. The general goal is to assure that the top documents really are the best, not to necessarily limit the total document count. Focusing on the latter could be a real waste of time. It's still not clear why or how you need or want to use OR as the default operator - you still haven't given us a use case for that. To repeat: Give us a full set of use cases before taking this XY Problem approach of pursuing a solution before the problem is understood. -- Jack Krupansky -Original Message- From: S.L Sent: Sunday, March 30, 2014 6:14 PM To: solr-user@lucene.apache.org Subject: Re: eDismax parser and the mm parameter Jacks Thanks Again, I am searching Chinese medicine documents , as the example I gave earlier a user can search for Ginseng or Siberian Ginseng or Red Siberian Ginseng , I certainly want to use pf parameter (which is not driven by mm parameter) , however for giving higher score to documents that have more of the terms I want to use edismax now if I give a mm of 3 and the search term is of only length 1 (like Ginseng) what does edisMax do ? On Sun, Mar 30, 2014 at 1:21 PM, Jack Krupansky j...@basetechnology.com wrote: It still depends on your objective - which you haven't told us yet. Show us some use cases and detail what your expectations are for each use case. The edismax phrase boosting is probably a lot more useful than messing around with mm. Take a look at pf, pf2, and pf3. See: http://wiki.apache.org/solr/ExtendedDisMax https://cwiki.apache.org/confluence/display/solr/The+ Extended+DisMax+Query+Parser The focus on mm may indeed be a classic XY Problem - a premature focus on a solution without detailing the problem. -- Jack Krupansky -Original Message- From: S.L Sent: Sunday, March 30, 2014 11:18 AM To: solr-user@lucene.apache.org Subject: Re: eDismax parser and the mm parameter Thanks Jack! I understand the intent of mm parameter, my question is that since the query terms being provided are not of fixed length I do not know what the mm should like for example Ginseng,Siberian Ginseng are my search terms. The first one can have an mm upto 1 and the second one can have an mm of upto 2 . Should I dynamically set the mm based on the number of search terms in my query ? Thanks again. On Sun, Mar 30, 2014 at 5:20 AM, Jack Krupansky j...@basetechnology.com wrote: 1. Yes, the default for mm is 1. 2. It depends on what you are really trying to do - you haven't told us. Generally, mm=1 is equivalent to q.op=OR, and mm=100% is equivalent to q.op=AND. Generally, use q.op unless you really know what you are doing. Generally, the intent of mm is to set the minimum number of OR/SHOULD clauses that must match on the top level of a query. -- Jack
SolrJ 503 Error
Hi All, I am running a single Solr instance with version 4.4 with Apache Tomcat 7.0.42 ,I am aslo running a Nutch instance with about 20 threads and each thread is committing a document in the Solr index using the Solrj API , the version of Solrj API I use is 4.3.1 , can anyone please let me know if this error is occuring because I am committing documents too fast for a single instance of a server or is it because of any other underlying issue , please let me know. Thanks. org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server at http://localhost:8081/solr returned non ok status:503, message:Service Unavailable at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372) ~[DynaOCrawlerUtils.jar:?] at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) ~[DynaOCrawlerUtils.jar:?] at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) ~[solr-solrj-3.4.0.jar:3.4.0 1167142 - mike - 2011-09-09 09:06:50] at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:86) ~[solr-solrj-3.4.0.jar:3.4.0 1167142 - mike - 2011-09-09 09:06:50] at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:75) ~[solr-solrj-3.4.0.jar:3.4.0 1167142 - mike - 2011-09-09 09:06:50] at com.xyz.DynaOCrawlerUtils.SolrDynaOUtils.createSolrInputDocumentAndPopulateSolrIndex(SolrDynaOUtils.java:101) ~[DynaOCrawlerUtils.jar:?] at com.xyz.DynaOCrawlerUtils.SolrCallbackForNXParser.populateModelToSolrIndex(SolrCallbackForNXParser.java:216) [DynaOCrawlerUtils.jar:?] at com.xyz.DynaOCrawlerUtils.SolrCallbackForNXParser.endDocument(SolrCallbackForNXParser.java:87) [DynaOCrawlerUtils.jar:?] at com.xyz.DynaOCrawlerUtils.SolrDynaOUtils.populateSolrIndexFromCurrentURL(SolrDynaOUtils.java:250) [DynaOCrawlerUtils.jar:?] at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:716) [job.jar:?]
Re: SolrJ 503 Error
I have a 8GB machine , and I commit for each and every document that is added to Solr, not sure if I am missing anything here , but it seems I could use auto commit from your response , in that case do I not need to call the commit call , can you please point me to a resource that explains this ? Thanks. On Sat, Dec 21, 2013 at 2:48 PM, Andrea Gazzarini agazzar...@apache.orgwrote: Not sure if we have the same scenario but I got the same error code when I was tryjng to do a lot of requests (updates and queries) with 10 secs of (hard) autocommit to a SOLR instance running in servlet engine (tomcat) with few resources (if I remember no more than 1GB of ram) Andrea Hi All, I am running a single Solr instance with version 4.4 with Apache Tomcat 7.0.42 ,I am aslo running a Nutch instance with about 20 threads and each thread is committing a document in the Solr index using the Solrj API , the version of Solrj API I use is 4.3.1 , can anyone please let me know if this error is occuring because I am committing documents too fast for a single instance of a server or is it because of any other underlying issue , please let me know. Thanks. org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server at http://localhost:8081/solr returned non ok status:503, message:Service Unavailable at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372) ~[DynaOCrawlerUtils.jar:?] at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) ~[DynaOCrawlerUtils.jar:?] at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) ~[solr-solrj-3.4.0.jar:3.4.0 1167142 - mike - 2011-09-09 09:06:50] at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:86) ~[solr-solrj-3.4.0.jar:3.4.0 1167142 - mike - 2011-09-09 09:06:50] at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:75) ~[solr-solrj-3.4.0.jar:3.4.0 1167142 - mike - 2011-09-09 09:06:50] at com.xyz.DynaOCrawlerUtils.SolrDynaOUtils.createSolrInputDocumentAndPopulateSolrIndex(SolrDynaOUtils.java:101) ~[DynaOCrawlerUtils.jar:?] at com.xyz.DynaOCrawlerUtils.SolrCallbackForNXParser.populateModelToSolrIndex(SolrCallbackForNXParser.java:216) [DynaOCrawlerUtils.jar:?] at com.xyz.DynaOCrawlerUtils.SolrCallbackForNXParser.endDocument(SolrCallbackForNXParser.java:87) [DynaOCrawlerUtils.jar:?] at com.xyz.DynaOCrawlerUtils.SolrDynaOUtils.populateSolrIndexFromCurrentURL(SolrDynaOUtils.java:250) [DynaOCrawlerUtils.jar:?] at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:716) [job.jar:?]