How to handle special characters in fuzzy search query
So my solr query is implemented in two parts,first query does an exact search if there are no results found for exact then it goes to the second query that does a fuzzy search. every things works fine but in situations like--A user enters burg + So in exact search no records will come,so second query is called to do a fuzzy search.Now comes the problem my fuzzy query does not understand special characters like +,-* which throws and error.If i dont pass special characters it works fine. But in real world a user can put characters with their search,which will throw an error. Now iam stuck in this and dont know how to resolve this issue. This is how my exact search query looks like $query1=(business_name:$data*^100 OR city_name:$data*^1 OR locality_name:$data*^6 OR business_search_tag_name:$data*^8 OR type_name:$data*^7) AND (business_active_flag:1) AND (business_visible_flag:1) AND (delete_status_businessmasters:0); This is how my fuzzy query looks like $query2='(_query_:%20{!complexphrase%20qf=business_name^100+type_name^0.4+locality_name^6%27}%20'.$url_new.')AND(business_active_flag:1)AND(business_point:[1.5 TO 2.0])q.op=ANDwt=jsonindent=true'; Iam new to solr and dont know how to tackle this situation. Details Solrphpclient php solr 4.9 -- Regards Madhav Bahuguna
How to get the docs id after commit
Hi, Solr Developers I want to get the newest commited docs in the postcommit event, then nofity the other server which data can be used, but I can not find any way to get the newest docs after commited, so is there any way to do this? Thank you. Wen Li
Re: Proximity searching in percentage
Hi Alessandro, I'm using Solr 5.0.0, but it is still able to work. Actually I found this to be better than query~1 or query~2, as it can automatically detect and allow the 20% error rate that I want. For this query~1 or query~2, does it mean that I'll have to manually detect how many characters did I enter, before I assign the suitable ~(tilde) param in order to achieve the 20% error rate? I'll probably need an edit distance of 0 for words with 3 or less characters, 1 for words with 4 to 9 characters, edit distance of 2 for words with 10 to 14 characters, and edit distance of 3 for words with more than 15 characters. Yes, for the performance I'm checking if the length check will affect the query time. Thanks for your info on that. Currently my index is small, so everything seems to run quite fast and the delay is un-noticeable. But not so sure if it will slow down till it is noticeable by the user if I have tens of collections with millions of records. Regards, Edwin On 8 May 2015 at 16:53, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Hi Zheng, actually that version of the fuzzy search is deprecated! Currently the fuzzy search syntax is : query~1 or query~2 The ~(tilde) param is the number of edit we provide to generate all the expanded query to run. Can I ask you which version of Solr are you using ? This article from 2011 shows the biggest change in fuzzy query, and I guess it's still the current approach! Related the performance, what do you mean ? Are you worried if the length check will affect the query time ? The answer is yes, but the delay will be un-noticeable as you simply check the length and apply the proper fuzzy param related. Regarding the fact fuzzy query being slower than a normal query, that is true, but the FST approach guarantee really fast fuzzy query. So if you do need the fuzziness, it's something you can cope with. Cheers 2015-05-08 3:12 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Thank you for the information. I've currently using the fuzzy search and set the edit distance value to ~0.79, and this has allowed a 20% error rate. (ie for words with 5 characters, it allows 1 mis-spelled character, and for words with 10 characters, it allows 2 mis-speed characters). However, for words with 4 characters, I'll need to set the value to ~0.75 to allow 1 mis-spelled character, as in order to accommodate 4 characters word, it requires a 25% error rate for 1 mis-spelled character. We probably will not accommodate for 3 characters word. I've gotten the information from here: http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Fuzzy%20Searches Just to check, will this affect the performance of the system? Regards, Edwin On 7 May 2015 at 20:00, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Hi ! Currently Solr builds FST to provide proper fuzzy search or spellcheck suggestions based on the string distance . The current default algorithm is the Levenstein distance ( that returns the number of edit as distance metric). In your case you should calculate client side, the edit you want to apply to your search. In your client code, should be not difficult to process the query and apply the proper number of edit depending on the length. Anyway the max edit for the levenstein default distance is fixed to 2 . Cheers 2015-05-05 10:24 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Hi, Would like to check, how do we implement character proximity searching that's in terms of percentage with regards to the length of the word, instead of a fixed number of edit distance (characters)? For example, if we have a proximity of 20%, a word with 5 characters will have an edit distance of 1, and a word with 10 characters will automatically have an edit distance of 2. Will Solr be able to do that for us? Regards, Edwin -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Queries on SynonymFilterFactory
Hi, Will like to check, for the SynonymFilterFactory, I have the following in my synonyms.txt: Titanium Dioxides, titanium oxide, pigment pigment, colour, colouring material If I set expend=false, and I search for q=pigment, I will get results that matches pigment, Titanium Dioxides and titanium oxide. But it will not maches colour and colouring materials, as all equivalent synonyms will only matches those first in the list. If I set expend=false, and I search for q=pigment, I'll get results that matches everything in the list (ie: Titanium Dioxides, titanium oxide, colour, colouring material) Is my understand correct? Also, I will like to check, how come if I search q=pigment (enclosed in quotes), I only get matches for Titanium Dioxides and not pigment? Regards, Edwin
Re: Proximity searching in percentage
Hi Zheng, actually that version of the fuzzy search is deprecated! Currently the fuzzy search syntax is : query~1 or query~2 The ~(tilde) param is the number of edit we provide to generate all the expanded query to run. Can I ask you which version of Solr are you using ? This article from 2011 shows the biggest change in fuzzy query, and I guess it's still the current approach! Related the performance, what do you mean ? Are you worried if the length check will affect the query time ? The answer is yes, but the delay will be un-noticeable as you simply check the length and apply the proper fuzzy param related. Regarding the fact fuzzy query being slower than a normal query, that is true, but the FST approach guarantee really fast fuzzy query. So if you do need the fuzziness, it's something you can cope with. Cheers 2015-05-08 3:12 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Thank you for the information. I've currently using the fuzzy search and set the edit distance value to ~0.79, and this has allowed a 20% error rate. (ie for words with 5 characters, it allows 1 mis-spelled character, and for words with 10 characters, it allows 2 mis-speed characters). However, for words with 4 characters, I'll need to set the value to ~0.75 to allow 1 mis-spelled character, as in order to accommodate 4 characters word, it requires a 25% error rate for 1 mis-spelled character. We probably will not accommodate for 3 characters word. I've gotten the information from here: http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Fuzzy%20Searches http://mail.growhill.com/cgi-bin/webmanager/webmail.cgi?cmd=urlxdata=~2-dd4639fc876fef5244efd32efa438fb90296a3eadadba2c6d7ce00url=http!3A!2F!2Flucene.apache.org!2Fcore!2F3_6_0!2Fqueryparsersyntax.html!23Fuzzy!2520Searches Just to check, will this affect the performance of the system? Regards, Edwin On 7 May 2015 at 20:00, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Hi ! Currently Solr builds FST to provide proper fuzzy search or spellcheck suggestions based on the string distance . The current default algorithm is the Levenstein distance ( that returns the number of edit as distance metric). In your case you should calculate client side, the edit you want to apply to your search. In your client code, should be not difficult to process the query and apply the proper number of edit depending on the length. Anyway the max edit for the levenstein default distance is fixed to 2 . Cheers 2015-05-05 10:24 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Hi, Would like to check, how do we implement character proximity searching that's in terms of percentage with regards to the length of the word, instead of a fixed number of edit distance (characters)? For example, if we have a proximity of 20%, a word with 5 characters will have an edit distance of 1, and a word with 10 characters will automatically have an edit distance of 2. Will Solr be able to do that for us? Regards, Edwin -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Queries on SynonymFilterFactory
Just an update, the tokenizer class which I'm using is StandardTokenizerFactory, and I'm using Solr 5.0. On 8 May 2015 16:24, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, Will like to check, for the SynonymFilterFactory, I have the following in my synonyms.txt: Titanium Dioxides, titanium oxide, pigment pigment, colour, colouring material If I set expend=false, and I search for q=pigment, I will get results that matches pigment, Titanium Dioxides and titanium oxide. But it will not maches colour and colouring materials, as all equivalent synonyms will only matches those first in the list. If I set expend=false, and I search for q=pigment, I'll get results that matches everything in the list (ie: Titanium Dioxides, titanium oxide, colour, colouring material) Is my understand correct? Also, I will like to check, how come if I search q=pigment (enclosed in quotes), I only get matches for Titanium Dioxides and not pigment? Regards, Edwin
Re: Queries on SynonymFilterFactory
Let's explain little bit better here : First of all, the SynonimFilter is a Token Filter, and being a Token Filter it can be part of an Analysis pipeline at Indexing and Query Time. As the different type of analysis explicitly explains when the filtering happens, let's go to the details of the synonyms.txt. This file contains a set of lines, each of them describing a synonym policy. There are 2 different syntaxes accepted : *couch,sofa,divanteh = thehuge,ginormous,humungous = largesmall = tiny,teeny,weeny* - A comma-separated list of words. If the token matches any of the words, then all the words in the list are substituted, which will include the original token. - Two comma-separated lists of words with the symbol = between them. If the token matches any word on the left, then the list on the right is substituted. The original token will not be included unless it is also in the list on the right. Related the expand param, directly from the official Solr documentation : expand: (optional; default: true) If true, a synonym will be expanded to all equivalent synonyms. If false, all equivalent synonyms will be reduced to the first in the list. So, starting from this definition let's answer to your questions: 1) Related the expand the definition seems quite clear, if anything strange is occurring to you, let me know 2) Related your second question, it depends on your synonym.txt file, if you are not using the = syntax, you are going to always retrieve all the synonyms( included the original term) If you need more info let me know, it can strictly depends how you are using the filter as well ( indexing ? querying ? both ? ) Example : If you are using the filter only at Indexing time, then using the = syntax will prevent the user to search for the original token in the synonym.txt relation. Because it will not appear in the index. Cheers 2015-05-08 9:24 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Hi, Will like to check, for the SynonymFilterFactory, I have the following in my synonyms.txt: Titanium Dioxides, titanium oxide, pigment pigment, colour, colouring material If I set expend=false, and I search for q=pigment, I will get results that matches pigment, Titanium Dioxides and titanium oxide. But it will not maches colour and colouring materials, as all equivalent synonyms will only matches those first in the list. If I set expend=false, and I search for q=pigment, I'll get results that matches everything in the list (ie: Titanium Dioxides, titanium oxide, colour, colouring material) Is my understand correct? Also, I will like to check, how come if I search q=pigment (enclosed in quotes), I only get matches for Titanium Dioxides and not pigment? Regards, Edwin -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Proximity searching in percentage
2015-05-08 10:14 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Hi Alessandro, I'm using Solr 5.0.0, but it is still able to work. Actually I found this to be better than query~1 or query~2, as it can automatically detect and allow the 20% error rate that I want. I don't think that the double param is supported anymore, so we should take a look to tricky formula underline to understand how the exact edits are calculated For this query~1 or query~2, does it mean that I'll have to manually detect how many characters did I enter, before I assign the suitable ~(tilde) param in order to achieve the 20% error rate? Yes I'll probably need an edit distance of 0 for words with 3 or less characters, 1 for words with 4 to 9 characters, edit distance of 2 for words with 10 to 14 characters, and edit distance of 3 for words with more than 15 characters. This would be quite easy, just check the length and assign the proper edit accordingly to your requirements. Yes, for the performance I'm checking if the length check will affect the query time. Thanks for your info on that. Currently my index is small, so everything seems to run quite fast and the delay is un-noticeable. But not so sure if it will slow down till it is noticeable by the user if I have tens of collections with millions of records. I think the length check will be constant time for any string ( if you are using java , most likely to be constant in all other languages) So i would say it won't be a problem in comparison with the actual query time. Regards, Edwin On 8 May 2015 at 16:53, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Hi Zheng, actually that version of the fuzzy search is deprecated! Currently the fuzzy search syntax is : query~1 or query~2 The ~(tilde) param is the number of edit we provide to generate all the expanded query to run. Can I ask you which version of Solr are you using ? This article from 2011 shows the biggest change in fuzzy query, and I guess it's still the current approach! Related the performance, what do you mean ? Are you worried if the length check will affect the query time ? The answer is yes, but the delay will be un-noticeable as you simply check the length and apply the proper fuzzy param related. Regarding the fact fuzzy query being slower than a normal query, that is true, but the FST approach guarantee really fast fuzzy query. So if you do need the fuzziness, it's something you can cope with. Cheers 2015-05-08 3:12 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Thank you for the information. I've currently using the fuzzy search and set the edit distance value to ~0.79, and this has allowed a 20% error rate. (ie for words with 5 characters, it allows 1 mis-spelled character, and for words with 10 characters, it allows 2 mis-speed characters). However, for words with 4 characters, I'll need to set the value to ~0.75 to allow 1 mis-spelled character, as in order to accommodate 4 characters word, it requires a 25% error rate for 1 mis-spelled character. We probably will not accommodate for 3 characters word. I've gotten the information from here: http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Fuzzy%20Searches Just to check, will this affect the performance of the system? Regards, Edwin On 7 May 2015 at 20:00, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Hi ! Currently Solr builds FST to provide proper fuzzy search or spellcheck suggestions based on the string distance . The current default algorithm is the Levenstein distance ( that returns the number of edit as distance metric). In your case you should calculate client side, the edit you want to apply to your search. In your client code, should be not difficult to process the query and apply the proper number of edit depending on the length. Anyway the max edit for the levenstein default distance is fixed to 2 . Cheers 2015-05-05 10:24 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com : Hi, Would like to check, how do we implement character proximity searching that's in terms of percentage with regards to the length of the word, instead of a fixed number of edit distance (characters)? For example, if we have a proximity of 20%, a word with 5 characters will have an edit distance of 1, and a word with 10 characters will automatically have an edit distance of 2. Will Solr be able to do that for us? Regards, Edwin -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?
Re: Queries on SynonymFilterFactory
Thanks for explaining the information. Currently I'm only using the comma-separated list of words and only using the synonym filter at query time. I find that when I set expend = true, there's quite a number of irrelevant results that came back, and this didn't happen when I set expend = false. I've yet to try the lists of words with the symbol = between them. I'm trying to solve the multi-word synonyms too, and I found that enclosing the multi-word in quotes will solve the issue. But this creates problem and the original token is not return if I enclose single word in quotes. Will using the lists of words with the symbol = between them better than the comma-separated list of words to cater to the multi-word synonyms? Regards, Edwin On 8 May 2015 at 17:10, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Let's explain little bit better here : First of all, the SynonimFilter is a Token Filter, and being a Token Filter it can be part of an Analysis pipeline at Indexing and Query Time. As the different type of analysis explicitly explains when the filtering happens, let's go to the details of the synonyms.txt. This file contains a set of lines, each of them describing a synonym policy. There are 2 different syntaxes accepted : *couch,sofa,divanteh = thehuge,ginormous,humungous = largesmall = tiny,teeny,weeny* - A comma-separated list of words. If the token matches any of the words, then all the words in the list are substituted, which will include the original token. - Two comma-separated lists of words with the symbol = between them. If the token matches any word on the left, then the list on the right is substituted. The original token will not be included unless it is also in the list on the right. Related the expand param, directly from the official Solr documentation : expand: (optional; default: true) If true, a synonym will be expanded to all equivalent synonyms. If false, all equivalent synonyms will be reduced to the first in the list. So, starting from this definition let's answer to your questions: 1) Related the expand the definition seems quite clear, if anything strange is occurring to you, let me know 2) Related your second question, it depends on your synonym.txt file, if you are not using the = syntax, you are going to always retrieve all the synonyms( included the original term) If you need more info let me know, it can strictly depends how you are using the filter as well ( indexing ? querying ? both ? ) Example : If you are using the filter only at Indexing time, then using the = syntax will prevent the user to search for the original token in the synonym.txt relation. Because it will not appear in the index. Cheers 2015-05-08 9:24 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Hi, Will like to check, for the SynonymFilterFactory, I have the following in my synonyms.txt: Titanium Dioxides, titanium oxide, pigment pigment, colour, colouring material If I set expend=false, and I search for q=pigment, I will get results that matches pigment, Titanium Dioxides and titanium oxide. But it will not maches colour and colouring materials, as all equivalent synonyms will only matches those first in the list. If I set expend=false, and I search for q=pigment, I'll get results that matches everything in the list (ie: Titanium Dioxides, titanium oxide, colour, colouring material) Is my understand correct? Also, I will like to check, how come if I search q=pigment (enclosed in quotes), I only get matches for Titanium Dioxides and not pigment? Regards, Edwin -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Solr 5.1.0 Cloud and Zookeeper
Hello Shacky, I have recently performed a manual installation of a Zookeeper ensemble (3 zookeepers) in the same machine. I used the upstart init script from official .deb configuration https://svn.apache.org/repos/asf/zookeeper/trunk/src/packages/deb/init.d/zookeeper and modified it in order to meet my custom installation. You can see the changes I made in original zookeeper.sh in this gist https://gist.github.com/manios/e4fbd700e0d8999f5e17 and also the modifications in other files in conf/ directory of Zookeeper distribution. Best regards, Christos 2015-05-05 16:35 GMT+03:00 shacky shack...@gmail.com: Thank you very much for your answer. I installed ZooKeeper 3.4.6 on my Debian (Wheezy) system, and it's working well. The only problem I have is that I'm looking for some init script but I cannot find anything. I'm also trying to adapt the script in Debian's zookeeperd package, but I have some problems. Do you know some working init scripts for ZooKeeper on Debian? 2015-05-05 15:30 GMT+02:00 Mark Miller markrmil...@gmail.com: A bug fix version difference probably won't matter. It's best to use the same version everyone else uses and the one our tests use, but it's very likely 3.4.5 will work without a hitch. - Mark On Tue, May 5, 2015 at 9:09 AM shacky shack...@gmail.com wrote: Hi. I read on https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble that Solr needs to use the same ZooKeeper version it owns (at the moment 3.4.6). Debian Jessie has ZooKeeper 3.4.5 (https://packages.debian.org/jessie/zookeeper). Are you sure that this version won't work with Solr 5.1.0? Thank you very much for your help! Bye
Re: Queries on SynonymFilterFactory
So it means like having more than 10 or 20 synonym files locally will still be faster than accessing external service? As I found out that zookeeper only allows the synonym.txt file to be a maximum of 1MB, and as my potential synonym file is more than 20MB, I'll need to split the file to more than 20 of them. Regards, Edwin
Re: Queries on SynonymFilterFactory
The document seems to point to using AutoPhrasingTokenFilter, putting an underscore to the multi-term or changing to index time synonyms. I'm also thinking of putting the synonyms onto a database or query some thesaurus website when the using enter the search key, instead of using the SynonymFilterFactory. For this, once user enter a search key, the program will retrieve the list of synonyms. Then I'll append the list to the search parameters (ie: q). I'll use the boosting relevancy to give the original term a higher boost, and the synonyms a lower boost. Is this a good solution? Regards, Edwin On 8 May 2015 17:40, Alessandro Benedetti benedetti.ale...@gmail.com wrote: I found this very interesting article that I think can help in better understanding the problem : http://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/ And this : http://opensourceconnections.com/blog/2013/10/27/why-is-multi-term-synonyms-so-hard-in-solr/ Take a look and let me know ! 2015-05-08 10:26 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Thanks for explaining the information. Currently I'm only using the comma-separated list of words and only using the synonym filter at query time. I find that when I set expend = true, there's quite a number of irrelevant results that came back, and this didn't happen when I set expend = false. I've yet to try the lists of words with the symbol = between them. I'm trying to solve the multi-word synonyms too, and I found that enclosing the multi-word in quotes will solve the issue. But this creates problem and the original token is not return if I enclose single word in quotes. Will using the lists of words with the symbol = between them better than the comma-separated list of words to cater to the multi-word synonyms? Regards, Edwin On 8 May 2015 at 17:10, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Let's explain little bit better here : First of all, the SynonimFilter is a Token Filter, and being a Token Filter it can be part of an Analysis pipeline at Indexing and Query Time. As the different type of analysis explicitly explains when the filtering happens, let's go to the details of the synonyms.txt. This file contains a set of lines, each of them describing a synonym policy. There are 2 different syntaxes accepted : *couch,sofa,divanteh = thehuge,ginormous,humungous = largesmall = tiny,teeny,weeny* - A comma-separated list of words. If the token matches any of the words, then all the words in the list are substituted, which will include the original token. - Two comma-separated lists of words with the symbol = between them. If the token matches any word on the left, then the list on the right is substituted. The original token will not be included unless it is also in the list on the right. Related the expand param, directly from the official Solr documentation : expand: (optional; default: true) If true, a synonym will be expanded to all equivalent synonyms. If false, all equivalent synonyms will be reduced to the first in the list. So, starting from this definition let's answer to your questions: 1) Related the expand the definition seems quite clear, if anything strange is occurring to you, let me know 2) Related your second question, it depends on your synonym.txt file, if you are not using the = syntax, you are going to always retrieve all the synonyms( included the original term) If you need more info let me know, it can strictly depends how you are using the filter as well ( indexing ? querying ? both ? ) Example : If you are using the filter only at Indexing time, then using the = syntax will prevent the user to search for the original token in the synonym.txt relation. Because it will not appear in the index. Cheers 2015-05-08 9:24 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Hi, Will like to check, for the SynonymFilterFactory, I have the following in my synonyms.txt: Titanium Dioxides, titanium oxide, pigment pigment, colour, colouring material If I set expend=false, and I search for q=pigment, I will get results that matches pigment, Titanium Dioxides and titanium oxide. But it will not maches colour and colouring materials, as all equivalent synonyms will only matches those first in the list. If I set expend=false, and I search for q=pigment, I'll get results that matches everything in the list (ie: Titanium Dioxides, titanium oxide, colour, colouring material) Is my understand correct? Also, I will like to check, how come if I search q=pigment (enclosed in quotes), I only get matches for Titanium Dioxides and not pigment?
Re: Proximity searching in percentage
Hi Alessandro, Thank you so much for the info. Will try that out. Regards, Edwin On 8 May 2015 17:27, Alessandro Benedetti benedetti.ale...@gmail.com wrote: 2015-05-08 10:14 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Hi Alessandro, I'm using Solr 5.0.0, but it is still able to work. Actually I found this to be better than query~1 or query~2, as it can automatically detect and allow the 20% error rate that I want. I don't think that the double param is supported anymore, so we should take a look to tricky formula underline to understand how the exact edits are calculated For this query~1 or query~2, does it mean that I'll have to manually detect how many characters did I enter, before I assign the suitable ~(tilde) param in order to achieve the 20% error rate? Yes I'll probably need an edit distance of 0 for words with 3 or less characters, 1 for words with 4 to 9 characters, edit distance of 2 for words with 10 to 14 characters, and edit distance of 3 for words with more than 15 characters. This would be quite easy, just check the length and assign the proper edit accordingly to your requirements. Yes, for the performance I'm checking if the length check will affect the query time. Thanks for your info on that. Currently my index is small, so everything seems to run quite fast and the delay is un-noticeable. But not so sure if it will slow down till it is noticeable by the user if I have tens of collections with millions of records. I think the length check will be constant time for any string ( if you are using java , most likely to be constant in all other languages) So i would say it won't be a problem in comparison with the actual query time. Regards, Edwin On 8 May 2015 at 16:53, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Hi Zheng, actually that version of the fuzzy search is deprecated! Currently the fuzzy search syntax is : query~1 or query~2 The ~(tilde) param is the number of edit we provide to generate all the expanded query to run. Can I ask you which version of Solr are you using ? This article from 2011 shows the biggest change in fuzzy query, and I guess it's still the current approach! Related the performance, what do you mean ? Are you worried if the length check will affect the query time ? The answer is yes, but the delay will be un-noticeable as you simply check the length and apply the proper fuzzy param related. Regarding the fact fuzzy query being slower than a normal query, that is true, but the FST approach guarantee really fast fuzzy query. So if you do need the fuzziness, it's something you can cope with. Cheers 2015-05-08 3:12 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Thank you for the information. I've currently using the fuzzy search and set the edit distance value to ~0.79, and this has allowed a 20% error rate. (ie for words with 5 characters, it allows 1 mis-spelled character, and for words with 10 characters, it allows 2 mis-speed characters). However, for words with 4 characters, I'll need to set the value to ~0.75 to allow 1 mis-spelled character, as in order to accommodate 4 characters word, it requires a 25% error rate for 1 mis-spelled character. We probably will not accommodate for 3 characters word. I've gotten the information from here: http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Fuzzy%20Searches Just to check, will this affect the performance of the system? Regards, Edwin On 7 May 2015 at 20:00, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Hi ! Currently Solr builds FST to provide proper fuzzy search or spellcheck suggestions based on the string distance . The current default algorithm is the Levenstein distance ( that returns the number of edit as distance metric). In your case you should calculate client side, the edit you want to apply to your search. In your client code, should be not difficult to process the query and apply the proper number of edit depending on the length. Anyway the max edit for the levenstein default distance is fixed to 2 . Cheers 2015-05-05 10:24 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com : Hi, Would like to check, how do we implement character proximity searching that's in terms of percentage with regards to the length of the word, instead of a fixed number of edit distance (characters)? For example, if we have a proximity of 20%, a word with 5 characters will have an edit distance of 1, and a word with 10 characters will automatically have an edit distance of 2. Will Solr be able to do that for us?
Re: Queries on SynonymFilterFactory
I found this very interesting article that I think can help in better understanding the problem : http://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/ And this : http://opensourceconnections.com/blog/2013/10/27/why-is-multi-term-synonyms-so-hard-in-solr/ Take a look and let me know ! 2015-05-08 10:26 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Thanks for explaining the information. Currently I'm only using the comma-separated list of words and only using the synonym filter at query time. I find that when I set expend = true, there's quite a number of irrelevant results that came back, and this didn't happen when I set expend = false. I've yet to try the lists of words with the symbol = between them. I'm trying to solve the multi-word synonyms too, and I found that enclosing the multi-word in quotes will solve the issue. But this creates problem and the original token is not return if I enclose single word in quotes. Will using the lists of words with the symbol = between them better than the comma-separated list of words to cater to the multi-word synonyms? Regards, Edwin On 8 May 2015 at 17:10, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Let's explain little bit better here : First of all, the SynonimFilter is a Token Filter, and being a Token Filter it can be part of an Analysis pipeline at Indexing and Query Time. As the different type of analysis explicitly explains when the filtering happens, let's go to the details of the synonyms.txt. This file contains a set of lines, each of them describing a synonym policy. There are 2 different syntaxes accepted : *couch,sofa,divanteh = thehuge,ginormous,humungous = largesmall = tiny,teeny,weeny* - A comma-separated list of words. If the token matches any of the words, then all the words in the list are substituted, which will include the original token. - Two comma-separated lists of words with the symbol = between them. If the token matches any word on the left, then the list on the right is substituted. The original token will not be included unless it is also in the list on the right. Related the expand param, directly from the official Solr documentation : expand: (optional; default: true) If true, a synonym will be expanded to all equivalent synonyms. If false, all equivalent synonyms will be reduced to the first in the list. So, starting from this definition let's answer to your questions: 1) Related the expand the definition seems quite clear, if anything strange is occurring to you, let me know 2) Related your second question, it depends on your synonym.txt file, if you are not using the = syntax, you are going to always retrieve all the synonyms( included the original term) If you need more info let me know, it can strictly depends how you are using the filter as well ( indexing ? querying ? both ? ) Example : If you are using the filter only at Indexing time, then using the = syntax will prevent the user to search for the original token in the synonym.txt relation. Because it will not appear in the index. Cheers 2015-05-08 9:24 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Hi, Will like to check, for the SynonymFilterFactory, I have the following in my synonyms.txt: Titanium Dioxides, titanium oxide, pigment pigment, colour, colouring material If I set expend=false, and I search for q=pigment, I will get results that matches pigment, Titanium Dioxides and titanium oxide. But it will not maches colour and colouring materials, as all equivalent synonyms will only matches those first in the list. If I set expend=false, and I search for q=pigment, I'll get results that matches everything in the list (ie: Titanium Dioxides, titanium oxide, colour, colouring material) Is my understand correct? Also, I will like to check, how come if I search q=pigment (enclosed in quotes), I only get matches for Titanium Dioxides and not pigment? Regards, Edwin -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Queries on SynonymFilterFactory
Accessing an external service ( such a thesaurus website) per each query, can slow down your system a lot. Having the synonyms locally, with the Solr integration is much better. Cheers 2015-05-08 11:46 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: The document seems to point to using AutoPhrasingTokenFilter, putting an underscore to the multi-term or changing to index time synonyms. I'm also thinking of putting the synonyms onto a database or query some thesaurus website when the using enter the search key, instead of using the SynonymFilterFactory. For this, once user enter a search key, the program will retrieve the list of synonyms. Then I'll append the list to the search parameters (ie: q). I'll use the boosting relevancy to give the original term a higher boost, and the synonyms a lower boost. Is this a good solution? Regards, Edwin On 8 May 2015 17:40, Alessandro Benedetti benedetti.ale...@gmail.com wrote: I found this very interesting article that I think can help in better understanding the problem : http://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/ And this : http://opensourceconnections.com/blog/2013/10/27/why-is-multi-term-synonyms-so-hard-in-solr/ Take a look and let me know ! 2015-05-08 10:26 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Thanks for explaining the information. Currently I'm only using the comma-separated list of words and only using the synonym filter at query time. I find that when I set expend = true, there's quite a number of irrelevant results that came back, and this didn't happen when I set expend = false. I've yet to try the lists of words with the symbol = between them. I'm trying to solve the multi-word synonyms too, and I found that enclosing the multi-word in quotes will solve the issue. But this creates problem and the original token is not return if I enclose single word in quotes. Will using the lists of words with the symbol = between them better than the comma-separated list of words to cater to the multi-word synonyms? Regards, Edwin On 8 May 2015 at 17:10, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Let's explain little bit better here : First of all, the SynonimFilter is a Token Filter, and being a Token Filter it can be part of an Analysis pipeline at Indexing and Query Time. As the different type of analysis explicitly explains when the filtering happens, let's go to the details of the synonyms.txt. This file contains a set of lines, each of them describing a synonym policy. There are 2 different syntaxes accepted : *couch,sofa,divanteh = thehuge,ginormous,humungous = largesmall = tiny,teeny,weeny* - A comma-separated list of words. If the token matches any of the words, then all the words in the list are substituted, which will include the original token. - Two comma-separated lists of words with the symbol = between them. If the token matches any word on the left, then the list on the right is substituted. The original token will not be included unless it is also in the list on the right. Related the expand param, directly from the official Solr documentation : expand: (optional; default: true) If true, a synonym will be expanded to all equivalent synonyms. If false, all equivalent synonyms will be reduced to the first in the list. So, starting from this definition let's answer to your questions: 1) Related the expand the definition seems quite clear, if anything strange is occurring to you, let me know 2) Related your second question, it depends on your synonym.txt file, if you are not using the = syntax, you are going to always retrieve all the synonyms( included the original term) If you need more info let me know, it can strictly depends how you are using the filter as well ( indexing ? querying ? both ? ) Example : If you are using the filter only at Indexing time, then using the = syntax will prevent the user to search for the original token in the synonym.txt relation. Because it will not appear in the index. Cheers 2015-05-08 9:24 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com : Hi, Will like to check, for the SynonymFilterFactory, I have the following in my synonyms.txt: Titanium Dioxides, titanium oxide, pigment pigment, colour, colouring material If I set expend=false, and I search for q=pigment, I will get results that matches pigment, Titanium Dioxides and titanium oxide. But it will not maches colour and colouring materials, as all equivalent synonyms will only
Re: Solr Multilingual Indexing with one field- Guidance
Is it possible to know a little bit more about the nature of that multi-lingual field ? I can see the keywordTokenizer and then a lot of grams calculated from that token . What is that field used for ? 2015-05-07 19:23 GMT+01:00 Kuntal Ganguly gangulykuntal1...@gmail.com: Our current production index size is 1.5 TB with 3 shards. Currently we have the following field type: fieldType name=text_ngram class=solr.TextField positionIncrementGap=100 analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.CustomNGramFilterFactory minGramSize=3 maxGramSize=30 preserveOriginal=true/ /analyzer /fieldType And the above field type is working well for the US and English language clients. Now we have some new Chinese and Japanese client ,so after google http://www.basistech.com/indexing-strategies-for-multilingual-search-with-solr-and-rosette/ https://docs.lucidworks.com/display/lweug/Multilingual+Indexing+and+Search for best approach for multilingual index,there seems to be pros/cons associated with every approach. Then i tried RnD with a single field approach and here's my new field type: fieldType name=text_multi class=solr.TextField positionIncrementGap=100 analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.CJKBigramFilterFactory/ /analyzer analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.CJKBigramFilterFactory/ filter class=solr.CustomNGramFilterFactory minGramSize=3 maxGramSize=30 preserveOriginal=true/ /analyzer /fieldType I have kept the same tokenizer, only changed the filters.And it is working well with all existing search /use-case for English documents as well as new use case for Chinese/Japanese documents. Now i have the following questions to the Solr experts/developer: 1) Is this a correct approach to do it? Or i'm missing something? 2) Can you give me an example where there will be problem with this above new field type? A use-case/scenario with example will be very helpful. 3) Also is there any problem in future with different clients coming up? Please provide some guidance -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: New core on Solr Cloud
Thank you very much Erick. Bye 2015-05-06 17:06 GMT+02:00 Erick Erickson erickerick...@gmail.com: That should have put one replica on each machine, if it did you're fine. Best, Erick On Wed, May 6, 2015 at 3:58 AM, shacky shack...@gmail.com wrote: Ok, I found out that the creation of new core/collection on Solr 5.1 is made with the bin/solr script. So I created a new collection with this command: ./solr create_collection -c test -replicationFactor 3 Is this the correct way? Thank you very much, Bye! 2015-05-06 10:02 GMT+02:00 shacky shack...@gmail.com: Hi. This is my first experience with Solr Cloud. I installed three Solr nodes with three ZooKeeper instances and they seemed to start well. Now I have to create a new replicated core and I'm trying to found out how I can do it. I found many examples about how to create shards and cores, but I have to create one core with only one shard replicated on all three nodes (so basically I want to have the same data on all three nodes). Could you help me to understand what is the correct way to make this, please? Thank you very much! Bye
Re: JSON Facet Analytics API in Solr 5.1
Hi Yonik, Any update for the question? Thanks in advance, Frank On Thu, May 7, 2015 at 2:49 PM, Frank li fudon...@gmail.com wrote: Is there any book to read so I won't ask such dummy questions? Thanks. On Thu, May 7, 2015 at 2:32 PM, Frank li fudon...@gmail.com wrote: This one does not have problem, but how do I include sort in this facet query. Basically, I want to write a solr query which can sort the facet count ascending. Something like http://localhost:8983/solr /demo/query?q=applejson.facet={field=price sort='count asc'} http://localhost:8983/solr/demo/query?q=applejson.facet=%7Bx:%27avg%28price%29%27%7D I really appreciate your help. Frank http://localhost:8983/solr/demo/query?q=applejson.facet=%7Bx:%27avg%28price%29%27%7D On Thu, May 7, 2015 at 2:24 PM, Yonik Seeley ysee...@gmail.com wrote: On Thu, May 7, 2015 at 4:47 PM, Frank li fudon...@gmail.com wrote: Hi Yonik, I am reading your blog. It is helpful. One question for you, for following example, curl http://localhost:8983/solr/query -d 'q=*:*rows=0 json.facet={ categories:{ type : terms, field : cat, sort : { x : desc}, facet:{ x : avg(price), y : sum(price) } } } ' If I want to write it in the format of this: http://localhost:8983/solr/query?q=applejson.facet={x:'avg(campaign_ult_defendant_cnt_is)'} , how do I do? What problems do you encounter when you try that? If you try that URL with curl, be aware that curly braces {} are special globbing characters in curl. Turn them off with the -g option: curl -g http://localhost:8983/solr/demo/query?q=applejson.facet={x:'avg(price)'} -Yonik
AW: determine big documents in the index?
On one of my fields (the phrase suggestion field) has 30'860'099 terms. Is this too much? Another field (the single word suggestion) has 2'156'218 terms. -Ursprüngliche Nachricht- Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch] Gesendet: Freitag, 8. Mai 2015 15:54 An: solr-user@lucene.apache.org Betreff: determine big documents in the index? Context: Solr/Lucene 5.1 Is there a way to determine documents that occupy alot space in the index. As I don't store any fields that have text, it must be the terms extracted from the documents occupying the space. So my question is: which documents occupy a most space in the inverted index? Context: I index approx 7000pdfs (extracted with tika) into my index. I suspect that for some pdf's the extarcted text is not really text but binary blobs. In order to verify this (and possibly omit these pdfs) I hope to get some hints of Solr/Lucene ;)
SolrCloud 4.8.0 - Snapshots directory take a lot of space
Hi All, Looking at data directory in my solrcloud cluster I have found a lot of old snapshot directory in Like these: snapshot.20150506003702765 snapshot.20150506003702760 snapshot.20150507002849492 snapshot.20150507002849473 snapshot.20150507002849459 or even a month older. These directories keep really a lot of space, 2 or 3 times the whole index. May I delete these directories? If yes, is there a best practice? -- Vincenzo D'Amore email: v.dam...@gmail.com skype: free.dev mobile: +39 349 8513251
Re: Queries on SynonymFilterFactory
This is a quite big Sinonym corpus ! If it's not feasible to have only 1 big synonym file ( I haven't checked, so I assume the 1 Mb limit is true, even if strange) I would do an experiment : 1) testing query time with a Solr Classic config 2) Use an Ad Hoc Solr Core to manage Synonyms ( in this way we can keep it updated and use it with a custom version of the Sysnonym filter that will get the Synonyms directly from another Solr instance). 2b) develop a Solr plugin to provide this approach If the synonym thesaurus is really big, I guess managing them through another Solr Core ( or something similar) locally , will be better than managing it with an external web service. Cheers 2015-05-08 12:16 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: So it means like having more than 10 or 20 synonym files locally will still be faster than accessing external service? As I found out that zookeeper only allows the synonym.txt file to be a maximum of 1MB, and as my potential synonym file is more than 20MB, I'll need to split the file to more than 20 of them. Regards, Edwin -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
determine big documents in the index?
Context: Solr/Lucene 5.1 Is there a way to determine documents that occupy alot space in the index. As I don't store any fields that have text, it must be the terms extracted from the documents occupying the space. So my question is: which documents occupy a most space in the inverted index? Context: I index approx 7000pdfs (extracted with tika) into my index. I suspect that for some pdf's the extarcted text is not really text but binary blobs. In order to verify this (and possibly omit these pdfs) I hope to get some hints of Solr/Lucene ;)
Re: How to handle special characters in fuzzy search query
Each of the characters you identified are characters that have meaning to the query parser, '+' is a mandatory clause, '-' is a NOT operator and * is a wildcard. To get through the query parser, these (and a bunch of others, see below) must be escaped. Personally, though, I'd pre-scrub the data. Depending on your analysis chain such things may be thrown away anyway. https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser - the escaping special characters bit. Best, Erick On Thu, May 7, 2015 at 11:28 PM, Madhav Bahuguna madhav.bahug...@gmail.com wrote: So my solr query is implemented in two parts,first query does an exact search if there are no results found for exact then it goes to the second query that does a fuzzy search. every things works fine but in situations like--A user enters burg + So in exact search no records will come,so second query is called to do a fuzzy search.Now comes the problem my fuzzy query does not understand special characters like +,-* which throws and error.If i dont pass special characters it works fine. But in real world a user can put characters with their search,which will throw an error. Now iam stuck in this and dont know how to resolve this issue. This is how my exact search query looks like $query1=(business_name:$data*^100 OR city_name:$data*^1 OR locality_name:$data*^6 OR business_search_tag_name:$data*^8 OR type_name:$data*^7) AND (business_active_flag:1) AND (business_visible_flag:1) AND (delete_status_businessmasters:0); This is how my fuzzy query looks like $query2='(_query_:%20{!complexphrase%20qf=business_name^100+type_name^0.4+locality_name^6%27}%20'.$url_new.')AND(business_active_flag:1)AND(business_point:[1.5 TO 2.0])q.op=ANDwt=jsonindent=true'; Iam new to solr and dont know how to tackle this situation. Details Solrphpclient php solr 4.9 -- Regards Madhav Bahuguna
Re: determine big documents in the index?
Oops, this may be a better link: http://lucidworks.com/blog/indexing-with-solrj/ On Fri, May 8, 2015 at 9:55 AM, Erick Erickson erickerick...@gmail.com wrote: bq: has 30'860'099 terms. Is this too much Depends on how you indexed it. If you used shingles, then maybe, maybe not. If you just do normal text analysis, it's suspicious to say the least. There are about 300K words in the English language and you have 100X that. So either 1 you have a lot of legitimately unique terms, say part numbers, SKUs, etc. digits analyzed as text, whatever. 2 you have a lot of garbage in your input. OCR is notorious for this, as are binary blobs. The TermsComponent is your friend, it'll allow you to get an idea of what the actual terms are, it does take a bit of poking around though. There's no good way I know of to know which docs are taking up space in the index. What I'd probably do is use Tika in a SolrJ client and look at the data as I sent it, here's a place to start: https://lucidworks.com/blog/dev/2012/02/14/indexing-with-solrj/ Best, Erick On Fri, May 8, 2015 at 7:30 AM, Clemens Wyss DEV clemens...@mysign.ch wrote: On one of my fields (the phrase suggestion field) has 30'860'099 terms. Is this too much? Another field (the single word suggestion) has 2'156'218 terms. -Ursprüngliche Nachricht- Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch] Gesendet: Freitag, 8. Mai 2015 15:54 An: solr-user@lucene.apache.org Betreff: determine big documents in the index? Context: Solr/Lucene 5.1 Is there a way to determine documents that occupy alot space in the index. As I don't store any fields that have text, it must be the terms extracted from the documents occupying the space. So my question is: which documents occupy a most space in the inverted index? Context: I index approx 7000pdfs (extracted with tika) into my index. I suspect that for some pdf's the extarcted text is not really text but binary blobs. In order to verify this (and possibly omit these pdfs) I hope to get some hints of Solr/Lucene ;)
Re: Slow highlighting on Solr 5.0.0
I¹ve been looking into this again. The phrase highlighter is much slower than the default highlighter, so you might be able to add hl.usePhraseHighlighter=false to your query to make it faster. Note that web interface will NOT help here, because that param is true by default, and the checkbox is basically broken in that respect. Also, the default highlighter doesn¹t seem to work in all case the phrase highlighter does though. Also, the current development branch of 5x is much better than 5.1, but not as good as 4.10. This ticket seems to be hitting on some of the issues at hand: https://issues.apache.org/jira/browse/SOLR-5855 I think this means they are getting there, but the performance is really still much worse than 4.10, and its not obvious why. On 5/5/15, 2:06 AM, Ere Maijala ere.maij...@helsinki.fi wrote: I'm seeing the same with Solr 5.1.0 after upgrading from 4.10.2. Here are my timings: 4.10.2: process: 1432.0 highlight: 723.0 5.1.0: process: 9570.0 highlight: 8790.0 schema.xml and solrconfig.xml are available at https://github.com/NatLibFi/NDL-VuFind-Solr/tree/master/vufind/biblio/conf . A couple of jstack outputs taken when the query was executing are available at http://pastebin.com/eJrEy2Wb Any suggestions would be appreciated. Or would it make sense to just file a JIRA issue? --Ere 3.3.2015, 0.48, Matt Hilt kirjoitti: Short form: While testing Solr 5.0.0 within our staging environment, I noticed that highlight enabled queries are much slower than I saw with 4.10. Are there any obvious reasons why this might be the case? As far as I can tell, nothing has changed with the default highlight search component or its parameters. A little more detail: The bulk of the collection config set was stolen from the basic 4.X example config set. I changed my schema.xml and solrconfig.xml just enough to get 5.0 to create a new collection (removed non-trie fields, some other deprecated response handler definitions, etc). I can provide my version of the solr.HighlightComponent config, but it is identical to the sample_techproducts_configs example in 5.0. Are there any other config files I could provide that might be useful? Number on ³much slower²: I indexed a very small subset of my data into the new collection and used the /select interface to do a simple debug query. Solr 4.10 gives the following pertinent info: response: { numFound: 72628, ... debug: { timing: { time: 95, process: { time: 94, query: { time: 6 }, highlight: { time: 84 }, debug: { time: 4 } } --- Whereas solr 5.0 is: response: { numFound: 1093, ... debug: { timing: { time: 6551, process: { time: 6549, query: { time: 0 }, highlight: { time: 6524 }, debug: { time: 25 } -- Ere Maijala Kansalliskirjasto / The National Library of Finland smime.p7s Description: S/MIME cryptographic signature
Re: Queries on SynonymFilterFactory
Thank you for your suggestions. I can't do a proper testing on that yet as I'm currently using a 4GB RAM normal PC machine, and all these probably requires more RAM that what I have. I've tried running the setup with 20 synonyms file, and the system went Out of Memory before I could test anything. For your option 2), do you mean that I'll need to download a synonym database (like the one with over 20MB in size which I have), and index them into an Ad Hoc Solr Core to manage them? I probably can only try them out properly when I can get the server machine with more RAM. Regards, Edwin On 8 May 2015 at 22:16, Alessandro Benedetti benedetti.ale...@gmail.com wrote: This is a quite big Sinonym corpus ! If it's not feasible to have only 1 big synonym file ( I haven't checked, so I assume the 1 Mb limit is true, even if strange) I would do an experiment : 1) testing query time with a Solr Classic config 2) Use an Ad Hoc Solr Core to manage Synonyms ( in this way we can keep it updated and use it with a custom version of the Sysnonym filter that will get the Synonyms directly from another Solr instance). 2b) develop a Solr plugin to provide this approach If the synonym thesaurus is really big, I guess managing them through another Solr Core ( or something similar) locally , will be better than managing it with an external web service. Cheers 2015-05-08 12:16 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: So it means like having more than 10 or 20 synonym files locally will still be faster than accessing external service? As I found out that zookeeper only allows the synonym.txt file to be a maximum of 1MB, and as my potential synonym file is more than 20MB, I'll need to split the file to more than 20 of them. Regards, Edwin -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: How to get the docs id after commit
Not that I know of. newest doc id is pretty ambiguous. If I transmit a batch of 100 docs then commit, they're all committed at once. Which one, then, is newest? And consider what happens if (in SolrCloud) mode, I send updates to two separate nodes. The docs are forwarded to the leader for the shard they belong on, and arrival order is not guaranteed so which one is newest? Your best bet I think is to include, say, a timestamp or some such that represents what you consider newest, then just do a *:* query and sort by your marker descending. The first doc returned will be what you defined as newest. Best, Erick On Fri, May 8, 2015 at 12:56 AM, liwen(李文).apabi l@founder.com.cn wrote: Hi, Solr Developers I want to get the newest commited docs in the postcommit event, then nofity the other server which data can be used, but I can not find any way to get the newest docs after commited, so is there any way to do this? Thank you. Wen Li
Best way to backup and restore an index for a cloud setup in 4.6.1?
All, With a cloud setup for a collection in 4.6.1, what is the most elegant way to backup and restore an index? We are specifically looking into the application of when doing a full reindex, with the idea of building an index on one set of servers, backing up the index, and then restoring that backup on another set of servers. Is there a better way to rebuild indexes on another set of servers? We are not sharding if that makes any difference. Thanks, g10vstmoney
Re: determine big documents in the index?
bq: has 30'860'099 terms. Is this too much Depends on how you indexed it. If you used shingles, then maybe, maybe not. If you just do normal text analysis, it's suspicious to say the least. There are about 300K words in the English language and you have 100X that. So either 1 you have a lot of legitimately unique terms, say part numbers, SKUs, etc. digits analyzed as text, whatever. 2 you have a lot of garbage in your input. OCR is notorious for this, as are binary blobs. The TermsComponent is your friend, it'll allow you to get an idea of what the actual terms are, it does take a bit of poking around though. There's no good way I know of to know which docs are taking up space in the index. What I'd probably do is use Tika in a SolrJ client and look at the data as I sent it, here's a place to start: https://lucidworks.com/blog/dev/2012/02/14/indexing-with-solrj/ Best, Erick On Fri, May 8, 2015 at 7:30 AM, Clemens Wyss DEV clemens...@mysign.ch wrote: On one of my fields (the phrase suggestion field) has 30'860'099 terms. Is this too much? Another field (the single word suggestion) has 2'156'218 terms. -Ursprüngliche Nachricht- Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch] Gesendet: Freitag, 8. Mai 2015 15:54 An: solr-user@lucene.apache.org Betreff: determine big documents in the index? Context: Solr/Lucene 5.1 Is there a way to determine documents that occupy alot space in the index. As I don't store any fields that have text, it must be the terms extracted from the documents occupying the space. So my question is: which documents occupy a most space in the inverted index? Context: I index approx 7000pdfs (extracted with tika) into my index. I suspect that for some pdf's the extarcted text is not really text but binary blobs. In order to verify this (and possibly omit these pdfs) I hope to get some hints of Solr/Lucene ;)
Re: How to handle special characters in fuzzy search query
Steven: They're listed on the ref guide I posted. Not a concise list, but you'll see || and other interesting bits. On Fri, May 8, 2015 at 9:20 AM, Steven White swhite4...@gmail.com wrote: Hi Erick, Is there a documented list of all operators (AND, OR, NOT, etc.) that also need to be escaped? Are there more beside the 3 I listed? Thanks Steve On Fri, May 8, 2015 at 11:47 AM, Erick Erickson erickerick...@gmail.com wrote: Each of the characters you identified are characters that have meaning to the query parser, '+' is a mandatory clause, '-' is a NOT operator and * is a wildcard. To get through the query parser, these (and a bunch of others, see below) must be escaped. Personally, though, I'd pre-scrub the data. Depending on your analysis chain such things may be thrown away anyway. https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser - the escaping special characters bit. Best, Erick On Thu, May 7, 2015 at 11:28 PM, Madhav Bahuguna madhav.bahug...@gmail.com wrote: So my solr query is implemented in two parts,first query does an exact search if there are no results found for exact then it goes to the second query that does a fuzzy search. every things works fine but in situations like--A user enters burg + So in exact search no records will come,so second query is called to do a fuzzy search.Now comes the problem my fuzzy query does not understand special characters like +,-* which throws and error.If i dont pass special characters it works fine. But in real world a user can put characters with their search,which will throw an error. Now iam stuck in this and dont know how to resolve this issue. This is how my exact search query looks like $query1=(business_name:$data*^100 OR city_name:$data*^1 OR locality_name:$data*^6 OR business_search_tag_name:$data*^8 OR type_name:$data*^7) AND (business_active_flag:1) AND (business_visible_flag:1) AND (delete_status_businessmasters:0); This is how my fuzzy query looks like $query2='(_query_:%20{!complexphrase%20qf=business_name^100+type_name^0.4+locality_name^6%27}%20'.$url_new.')AND(business_active_flag:1)AND(business_point:[1.5 TO 2.0])q.op=ANDwt=jsonindent=true'; Iam new to solr and dont know how to tackle this situation. Details Solrphpclient php solr 4.9 -- Regards Madhav Bahuguna
Re: How to handle special characters in fuzzy search query
Hi Erick, Is there a documented list of all operators (AND, OR, NOT, etc.) that also need to be escaped? Are there more beside the 3 I listed? Thanks Steve On Fri, May 8, 2015 at 11:47 AM, Erick Erickson erickerick...@gmail.com wrote: Each of the characters you identified are characters that have meaning to the query parser, '+' is a mandatory clause, '-' is a NOT operator and * is a wildcard. To get through the query parser, these (and a bunch of others, see below) must be escaped. Personally, though, I'd pre-scrub the data. Depending on your analysis chain such things may be thrown away anyway. https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser - the escaping special characters bit. Best, Erick On Thu, May 7, 2015 at 11:28 PM, Madhav Bahuguna madhav.bahug...@gmail.com wrote: So my solr query is implemented in two parts,first query does an exact search if there are no results found for exact then it goes to the second query that does a fuzzy search. every things works fine but in situations like--A user enters burg + So in exact search no records will come,so second query is called to do a fuzzy search.Now comes the problem my fuzzy query does not understand special characters like +,-* which throws and error.If i dont pass special characters it works fine. But in real world a user can put characters with their search,which will throw an error. Now iam stuck in this and dont know how to resolve this issue. This is how my exact search query looks like $query1=(business_name:$data*^100 OR city_name:$data*^1 OR locality_name:$data*^6 OR business_search_tag_name:$data*^8 OR type_name:$data*^7) AND (business_active_flag:1) AND (business_visible_flag:1) AND (delete_status_businessmasters:0); This is how my fuzzy query looks like $query2='(_query_:%20{!complexphrase%20qf=business_name^100+type_name^0.4+locality_name^6%27}%20'.$url_new.')AND(business_active_flag:1)AND(business_point:[1.5 TO 2.0])q.op=ANDwt=jsonindent=true'; Iam new to solr and dont know how to tackle this situation. Details Solrphpclient php solr 4.9 -- Regards Madhav Bahuguna
Re: Limit the documents for each shard in solr cloud
Hi, Actually we are facing lot of issues with Solr shards in our environment. Our environment is fully loaded with around 150 million documents where each document will have around 50+ stored fields which has multiple values. And also we have lot of custom components in this environment which are using FieldCache and various other Solr features. The main issue we are facing is shards going down frequently in Solr cloud. As you mentioned in this reply and I also I have observed various other reply on memory issues. I will try to debug further and keep posted here if any issues I found in that process. Thanks, Jilani On Thu, May 7, 2015 at 10:17 PM, Daniel Collins danwcoll...@gmail.com wrote: Jilani, you did say My team needs that option if at all possible, my first response would be why?. Why do they want to limit the number of documents per shard, what's the rationale/use case behind that requirement? Once we understand that, we can explain why its a bad idea. :) I suspect I'm re-iterating Jack's comments, but why are you sharding in the first place? 8 shards split across 4 machines, so 2 shards per machine. But you have 2 replicas of each shard, so you have 16 Solr core, and hence 4 Solr cores per machine? Since you need an instance of all 8 shards to be up in order to service requests, you can get away with everything on 2 machines, but you still have 8 Solr cores to manage in order to have a fully functioning system. What's the benefit of sharding in this scenario? Sharding adds complexity, so you normally only add sharding if your search times are too slow without it. You need to work out how much disk space the whole 20m docs is going to take (maybe index 1m or 5m docs and extrapolate if they are all equivalent in size), then split it across 4 machines. But as Erick points out you need to allow for merges to occur, so whatever the space of the static data set, you need to allow for double that from time to time if background merges are happening. On 7 May 2015 at 16:05, Jack Krupansky jack.krupan...@gmail.com wrote: A leader is also a replica - SolrCloud is not a master/slave architecture. Any replica can be elected to be the leader, but that is only temporary and can change over time. You can place multiple shards on a single node, but was that really your intention? Generally, number of nodes equals number of shards times the replication factor. But then divided by shards per node if you do place more than one shard per node. -- Jack Krupansky On Thu, May 7, 2015 at 1:29 AM, Jilani Shaik jilani24...@gmail.com wrote: Hi, Is it possible to restrict number of documents per shard in Solr cloud? Lets say we have Solr cloud with 4 nodes, and on each node we have one leader and one replica. Like wise total we have 8 shards that includes replicas. Now I need to index my documents in such a way that each shard will have only 5 million documents. Total documents in Solr cloud should be 20 million documents. Thanks, Jilani
Re: Not able to Add docValues in Solr
Never mind.. used the zkcli.sh that comes with solr to accomplish the firewall -- View this message in context: http://lucene.472066.n3.nabble.com/Not-able-to-Add-docValues-in-Solr-tp4204405p4204579.html Sent from the Solr - User mailing list archive at Nabble.com.
indexing java byte code in classes / jars
I looking to use Solr search over the byte code in Classes and Jars. Does anyone know or have experience of Analyzers, Tokenizers, and Token Filters for such a task? Regards Mark
Re: How to handle special characters in fuzzy search query
FWIW you may also want to drop the boolean ops in favour of + and - (OR being default) pozdrawiam, LAFK 2015-05-08 18:59 GMT+02:00 Erick Erickson erickerick...@gmail.com: Steven: They're listed on the ref guide I posted. Not a concise list, but you'll see || and other interesting bits. On Fri, May 8, 2015 at 9:20 AM, Steven White swhite4...@gmail.com wrote: Hi Erick, Is there a documented list of all operators (AND, OR, NOT, etc.) that also need to be escaped? Are there more beside the 3 I listed? Thanks Steve On Fri, May 8, 2015 at 11:47 AM, Erick Erickson erickerick...@gmail.com wrote: Each of the characters you identified are characters that have meaning to the query parser, '+' is a mandatory clause, '-' is a NOT operator and * is a wildcard. To get through the query parser, these (and a bunch of others, see below) must be escaped. Personally, though, I'd pre-scrub the data. Depending on your analysis chain such things may be thrown away anyway. https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser - the escaping special characters bit. Best, Erick On Thu, May 7, 2015 at 11:28 PM, Madhav Bahuguna madhav.bahug...@gmail.com wrote: So my solr query is implemented in two parts,first query does an exact search if there are no results found for exact then it goes to the second query that does a fuzzy search. every things works fine but in situations like--A user enters burg + So in exact search no records will come,so second query is called to do a fuzzy search.Now comes the problem my fuzzy query does not understand special characters like +,-* which throws and error.If i dont pass special characters it works fine. But in real world a user can put characters with their search,which will throw an error. Now iam stuck in this and dont know how to resolve this issue. This is how my exact search query looks like $query1=(business_name:$data*^100 OR city_name:$data*^1 OR locality_name:$data*^6 OR business_search_tag_name:$data*^8 OR type_name:$data*^7) AND (business_active_flag:1) AND (business_visible_flag:1) AND (delete_status_businessmasters:0); This is how my fuzzy query looks like $query2='(_query_:%20{!complexphrase%20qf=business_name^100+type_name^0.4+locality_name^6%27}%20'.$url_new.')AND(business_active_flag:1)AND(business_point:[1.5 TO 2.0])q.op=ANDwt=jsonindent=true'; Iam new to solr and dont know how to tackle this situation. Details Solrphpclient php solr 4.9 -- Regards Madhav Bahuguna
Re: and stopword in user query is being change to q.op=AND
Thanks Show and Hoss. Just added lowercaseOperators=false to my edismax config and everything seems to be working. *Thanks,* *Rajesh,* *(mobile) : 8328789519.* On Mon, Apr 27, 2015 at 11:53 AM, Rajesh Hazari rajeshhaz...@gmail.com wrote: I did go through the documentation of edismax (solr 5.1 documentation), that suggests to use *stopwords* query param that signal the parser to respect stopfilterfactory while parsing, still i did not find this is happening. my final query looks like this http://host/solr/collection/select?q=term1+and+term2sort=update_time+descrows=1wt=jsonindent=truedebugQuery=truedefType=edismaxstopwords=trugroup=truegroup.ngroups=truegroup.field=titlestopwords=true debug:{ rawquerystring:term1 and term2, querystring:term1 and term2, parsedquery:(+(+DisjunctionMaxQuery((textSpell:term1)) +DisjunctionMaxQuery((textSpell:term2/no_coord, parsedquery_toString:+(+(textSpell:term1) +(textSpell:term2)), explain:{}, QParser:ExtendedDismaxQParser,... .. Is this param introduced and supports from specific version of solr! our solr version is 4.7 and 4.9. *Thanks,* *Rajesh**.* On Sun, Apr 26, 2015 at 9:22 PM, Rajesh Hazari rajeshhaz...@gmail.com wrote: Thank you Hoss from correcting my understanding, again i missed this concept of edismax. Do we have any solrj class or helper to handle the scenario to pass on the query terms (by stripping stopwords ) to edismax using solrj api. for ex: if user queries for *term1 and term2* build and query to pass on this to edismax so that this user query will be parsed as *parsedquery: (+(DisjunctionMaxQuery((textSpell:term1) DisjunctionMaxQuery((textSpell:term2/no_coord * *Thanks,* *Rajesh**.* On Fri, Apr 24, 2015 at 1:13 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : I was under understanding that stopwords are filtered even before being : parsed by search handler, i do have the filter in collection schema to : filter stopwords and the analysis shows that this stopword is filtered Generally speaking, your understanding of the order of operations for query parsing (regardless of hte parser) and analysis (regardless of the fields/analyzers/filters/etc...) is backwards. the query parser gets, as it's input, the query string (as a *single* string) and the request params. it inspects/parses the string according to it's rules options syntax and based on what it finds in that string (and in other request params) it passes some/all of that string to the analyzer for one or more fields, and uses the results of those analyzers as the terms for building up a query structure. ask yourself: if the raw user query input was first passed to an analyzer (for stop word filtering as you suggest) before the being passed to the query parser -- how would solr know what analyzer to use? in many parsers (like lucene and edismax) the fields to use can be specified *inside* the query string itself likewise: how would you ensure that syntactically significant string sequences (like ( and : and AND etc..) that an analyzer might normally strip out based on the tokenizer/tokenfilters would be preserved so that the query parser could have them and use them to drive hte resulting query structure? -Hoss http://www.lucidworks.com/
Re: indexing java byte code in classes / jars
What do the various Java IDEs use for indexing classes for field/type/variable/method usage search? I imagine it's got to be bytecode. On Fri, May 8, 2015 at 2:40 PM, Tomasz Borek tomasz.bo...@gmail.com wrote: Out of curiosity: why bytecode? pozdrawiam, LAFK 2015-05-08 21:31 GMT+02:00 Mark javam...@gmail.com: I looking to use Solr search over the byte code in Classes and Jars. Does anyone know or have experience of Analyzers, Tokenizers, and Token Filters for such a task? Regards Mark
Re: indexing java byte code in classes / jars
Out of curiosity: why bytecode? pozdrawiam, LAFK 2015-05-08 21:31 GMT+02:00 Mark javam...@gmail.com: I looking to use Solr search over the byte code in Classes and Jars. Does anyone know or have experience of Analyzers, Tokenizers, and Token Filters for such a task? Regards Mark
Re: Solr Exception The remote server returned an error: (400) Bad Request.
Short answer: wget skips body on 400 assuming you didn't want error page stored. Long answer: get your error page with additional wget params, like so: ✗ wget -Sd http://10.0.3.113:8080/solr/collection1/vitas\?q\=coreD%3A25 DEBUG output created by Wget 1.15 on linux-gnu. URI encoding = `UTF-8' --2015-05-08 21:56:55-- http://10.0.3.113:8080/solr/collection1/vitas?q=coreD%3A25 Łączenie się z 10.0.3.113:8080... połączono. Created socket 3. Releasing 0x00aa35d0 (new refcount 0). Deleting unused 0x00aa35d0. ---request begin--- GET /solr/collection1/vitas?q=coreD%3A25 HTTP/1.1 User-Agent: Wget/1.15 (linux-gnu) Accept: */* Host: 10.0.3.113:8080 Connection: Keep-Alive ---request end--- Żądanie HTTP wysłano, oczekiwanie na odpowiedź... ---response begin--- HTTP/1.1 400 Bad Request Server: Apache-Coyote/1.1 Cache-Control: no-cache, no-store Pragma: no-cache Expires: Sat, 01 Jan 2000 01:00:00 GMT Last-Modified: Fri, 08 May 2015 19:56:55 GMT ETag: 14d351a25a9 Content-Type: application/json;charset=UTF-8 Transfer-Encoding: chunked Date: Fri, 08 May 2015 19:56:55 GMT Connection: close ---response end--- HTTP/1.1 400 Bad Request Server: Apache-Coyote/1.1 Cache-Control: no-cache, no-store Pragma: no-cache Expires: Sat, 01 Jan 2000 01:00:00 GMT Last-Modified: Fri, 08 May 2015 19:56:55 GMT ETag: 14d351a25a9 Content-Type: application/json;charset=UTF-8 Transfer-Encoding: chunked Date: Fri, 08 May 2015 19:56:55 GMT Connection: close Registered socket 3 for persistent reuse. URI content encoding = `UTF-8' Skipping 95 bytes of body: [{responseHeader:{status:400,QTime:2},error:{msg:undefined field coreD,code:400}} ] done. 2015-05-08 21:56:55 BŁĄD 400: Bad Request. pozdrawiam, LAFK 2015-05-05 17:44 GMT+02:00 marotosg marot...@gmail.com: Thanks for the answer but i don't think that's going to solve my problem.For instance if I copy this query in the chrome browserhttp://localhost:8080/solr48/person/select?q=CoreD:25I get this error.4001CoreD:25undefined field CoreD400If I use wget from linux wget http://localhost:8080/solr48/person/select?q=CoreD:25I get ERROR:400 Bad Request.Is any reason why I am not getting same error?Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Exception-The-remote-server-returned-an-error-400-Bad-Request-tp4203889p4203949.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: indexing java byte code in classes / jars
To answer why bytecode - because mostly the use case I have is looking to index as much detail from jars/classes. extract class names, method names signatures packages / imports I am considering using ASM in order to generate an analysis view of the class The sort of usecases I have would be method / signature searches. For example; 1) show any classes with a method named parse* 2) show any classes with a method named parse that passes in a type *json* ...etc In the past I have written something to reverse out javadocs from just java bytecode, using solr would move this idea considerably much more powerful. Thanks for the suggestions so far On 8 May 2015 at 21:19, Erik Hatcher erik.hatc...@gmail.com wrote: Oh, and sorry, I omitted a couple of details: # creating the “java” core/collection bin/solr create -c java # I ran this from my Solr source code checkout, so that SolrLogFormatter.class just happened to be handy Erik On May 8, 2015, at 4:11 PM, Erik Hatcher erik.hatc...@gmail.com wrote: What kinds of searches do you want to run? Are you trying to extract class names, method names, and such and make those searchable? If that’s the case, you need some kind of “parser” to reverse engineer that information from .class and .jar files before feeding it to Solr, which would happen before analysis. Java itself comes with a javap command that can do this; whether this is the “best” way to go for your scenario I don’t know, but here’s an interesting example pasted below (using Solr 5.x). — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class test.txt bin/post -c java test.txt now search for coreInfoMap http://localhost:8983/solr/java/browse?q=coreInfoMap I tried to be cleverer and use the stdin option of bin/post, like this: javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class | bin/post -c java -url http://localhost:8983/solr/java/update/extract -type text/plain -params literal.id=SolrLogFormatter -out yes -d but something isn’t working right with the stdin detection like that (it does work to `cat test.txt | bin/post…` though, hmmm) test.txt looks like this, `cat test.txt`: Compiled from SolrLogFormatter.java public class org.apache.solr.SolrLogFormatter extends java.util.logging.Formatter { long startTime; long lastTime; java.util.Maporg.apache.solr.SolrLogFormatter$Method, java.lang.String methodAlias; public boolean shorterFormat; java.util.Maporg.apache.solr.core.SolrCore, org.apache.solr.SolrLogFormatter$CoreInfo coreInfoMap; public java.util.Mapjava.lang.String, java.lang.String classAliases; static java.lang.ThreadLocaljava.lang.String threadLocal; public org.apache.solr.SolrLogFormatter(); public void setShorterFormat(); public java.lang.String format(java.util.logging.LogRecord); public void appendThread(java.lang.StringBuilder, java.util.logging.LogRecord); public java.lang.String _format(java.util.logging.LogRecord); public java.lang.String getHead(java.util.logging.Handler); public java.lang.String getTail(java.util.logging.Handler); public java.lang.String formatMessage(java.util.logging.LogRecord); public static void main(java.lang.String[]) throws java.lang.Exception; public static void go() throws java.lang.Exception; static {}; } On May 8, 2015, at 3:31 PM, Mark javam...@gmail.com wrote: I looking to use Solr search over the byte code in Classes and Jars. Does anyone know or have experience of Analyzers, Tokenizers, and Token Filters for such a task? Regards Mark
Re: Fuzzy phrases + weighting at query level or do I need to program?
Best I found so far is: +place:(+word1~ +word2~ +word3~) pozdrawiam, LAFK 2015-04-26 3:20 GMT+02:00 Tomasz Borek tomasz.bo...@gmail.com: Ave! How do I make fuzzy search on lengthy names? As in La Riviera Montana de los Diablos or Unified Mega Corp Super Dwelling? Across all queries? My query has 3 levels of results: Best results are: +title:X +place:Y - Q1 If none such are found, +title:x - Q2 then +place:Y - Q3 All in all: (Q1) (Q2) (Q3) Yonik's examples have fuzzy search by ~ on one-term, on more than one it becomes proximity search. pozdrawiam, LAFK
Re: indexing java byte code in classes / jars
Oh, and sorry, I omitted a couple of details: # creating the “java” core/collection bin/solr create -c java # I ran this from my Solr source code checkout, so that SolrLogFormatter.class just happened to be handy Erik On May 8, 2015, at 4:11 PM, Erik Hatcher erik.hatc...@gmail.com wrote: What kinds of searches do you want to run? Are you trying to extract class names, method names, and such and make those searchable? If that’s the case, you need some kind of “parser” to reverse engineer that information from .class and .jar files before feeding it to Solr, which would happen before analysis. Java itself comes with a javap command that can do this; whether this is the “best” way to go for your scenario I don’t know, but here’s an interesting example pasted below (using Solr 5.x). — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class test.txt bin/post -c java test.txt now search for coreInfoMap http://localhost:8983/solr/java/browse?q=coreInfoMap I tried to be cleverer and use the stdin option of bin/post, like this: javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class | bin/post -c java -url http://localhost:8983/solr/java/update/extract -type text/plain -params literal.id=SolrLogFormatter -out yes -d but something isn’t working right with the stdin detection like that (it does work to `cat test.txt | bin/post…` though, hmmm) test.txt looks like this, `cat test.txt`: Compiled from SolrLogFormatter.java public class org.apache.solr.SolrLogFormatter extends java.util.logging.Formatter { long startTime; long lastTime; java.util.Maporg.apache.solr.SolrLogFormatter$Method, java.lang.String methodAlias; public boolean shorterFormat; java.util.Maporg.apache.solr.core.SolrCore, org.apache.solr.SolrLogFormatter$CoreInfo coreInfoMap; public java.util.Mapjava.lang.String, java.lang.String classAliases; static java.lang.ThreadLocaljava.lang.String threadLocal; public org.apache.solr.SolrLogFormatter(); public void setShorterFormat(); public java.lang.String format(java.util.logging.LogRecord); public void appendThread(java.lang.StringBuilder, java.util.logging.LogRecord); public java.lang.String _format(java.util.logging.LogRecord); public java.lang.String getHead(java.util.logging.Handler); public java.lang.String getTail(java.util.logging.Handler); public java.lang.String formatMessage(java.util.logging.LogRecord); public static void main(java.lang.String[]) throws java.lang.Exception; public static void go() throws java.lang.Exception; static {}; } On May 8, 2015, at 3:31 PM, Mark javam...@gmail.com wrote: I looking to use Solr search over the byte code in Classes and Jars. Does anyone know or have experience of Analyzers, Tokenizers, and Token Filters for such a task? Regards Mark
Re: indexing java byte code in classes / jars
Erik, Thanks for the pretty much OOTB approach. I think I'm going to just try a range of approaches, and see how far I get. The IDE does this suggestion would be worth looking into as well. On 8 May 2015 at 22:14, Mark javam...@gmail.com wrote: https://searchcode.com/ looks really interesting, however I want to crunch as much searchable aspects out of jars sititng on a classpath or under a project structure... Really early days so I'm open to any suggestions On 8 May 2015 at 22:09, Mark javam...@gmail.com wrote: To answer why bytecode - because mostly the use case I have is looking to index as much detail from jars/classes. extract class names, method names signatures packages / imports I am considering using ASM in order to generate an analysis view of the class The sort of usecases I have would be method / signature searches. For example; 1) show any classes with a method named parse* 2) show any classes with a method named parse that passes in a type *json* ...etc In the past I have written something to reverse out javadocs from just java bytecode, using solr would move this idea considerably much more powerful. Thanks for the suggestions so far On 8 May 2015 at 21:19, Erik Hatcher erik.hatc...@gmail.com wrote: Oh, and sorry, I omitted a couple of details: # creating the “java” core/collection bin/solr create -c java # I ran this from my Solr source code checkout, so that SolrLogFormatter.class just happened to be handy Erik On May 8, 2015, at 4:11 PM, Erik Hatcher erik.hatc...@gmail.com wrote: What kinds of searches do you want to run? Are you trying to extract class names, method names, and such and make those searchable? If that’s the case, you need some kind of “parser” to reverse engineer that information from .class and .jar files before feeding it to Solr, which would happen before analysis. Java itself comes with a javap command that can do this; whether this is the “best” way to go for your scenario I don’t know, but here’s an interesting example pasted below (using Solr 5.x). — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class test.txt bin/post -c java test.txt now search for coreInfoMap http://localhost:8983/solr/java/browse?q=coreInfoMap I tried to be cleverer and use the stdin option of bin/post, like this: javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class | bin/post -c java -url http://localhost:8983/solr/java/update/extract -type text/plain -params literal.id=SolrLogFormatter -out yes -d but something isn’t working right with the stdin detection like that (it does work to `cat test.txt | bin/post…` though, hmmm) test.txt looks like this, `cat test.txt`: Compiled from SolrLogFormatter.java public class org.apache.solr.SolrLogFormatter extends java.util.logging.Formatter { long startTime; long lastTime; java.util.Maporg.apache.solr.SolrLogFormatter$Method, java.lang.String methodAlias; public boolean shorterFormat; java.util.Maporg.apache.solr.core.SolrCore, org.apache.solr.SolrLogFormatter$CoreInfo coreInfoMap; public java.util.Mapjava.lang.String, java.lang.String classAliases; static java.lang.ThreadLocaljava.lang.String threadLocal; public org.apache.solr.SolrLogFormatter(); public void setShorterFormat(); public java.lang.String format(java.util.logging.LogRecord); public void appendThread(java.lang.StringBuilder, java.util.logging.LogRecord); public java.lang.String _format(java.util.logging.LogRecord); public java.lang.String getHead(java.util.logging.Handler); public java.lang.String getTail(java.util.logging.Handler); public java.lang.String formatMessage(java.util.logging.LogRecord); public static void main(java.lang.String[]) throws java.lang.Exception; public static void go() throws java.lang.Exception; static {}; } On May 8, 2015, at 3:31 PM, Mark javam...@gmail.com wrote: I looking to use Solr search over the byte code in Classes and Jars. Does anyone know or have experience of Analyzers, Tokenizers, and Token Filters for such a task? Regards Mark
Re: indexing java byte code in classes / jars
What kinds of searches do you want to run? Are you trying to extract class names, method names, and such and make those searchable? If that’s the case, you need some kind of “parser” to reverse engineer that information from .class and .jar files before feeding it to Solr, which would happen before analysis. Java itself comes with a javap command that can do this; whether this is the “best” way to go for your scenario I don’t know, but here’s an interesting example pasted below (using Solr 5.x). — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class test.txt bin/post -c java test.txt now search for coreInfoMap http://localhost:8983/solr/java/browse?q=coreInfoMap I tried to be cleverer and use the stdin option of bin/post, like this: javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class | bin/post -c java -url http://localhost:8983/solr/java/update/extract -type text/plain -params literal.id=SolrLogFormatter -out yes -d but something isn’t working right with the stdin detection like that (it does work to `cat test.txt | bin/post…` though, hmmm) test.txt looks like this, `cat test.txt`: Compiled from SolrLogFormatter.java public class org.apache.solr.SolrLogFormatter extends java.util.logging.Formatter { long startTime; long lastTime; java.util.Maporg.apache.solr.SolrLogFormatter$Method, java.lang.String methodAlias; public boolean shorterFormat; java.util.Maporg.apache.solr.core.SolrCore, org.apache.solr.SolrLogFormatter$CoreInfo coreInfoMap; public java.util.Mapjava.lang.String, java.lang.String classAliases; static java.lang.ThreadLocaljava.lang.String threadLocal; public org.apache.solr.SolrLogFormatter(); public void setShorterFormat(); public java.lang.String format(java.util.logging.LogRecord); public void appendThread(java.lang.StringBuilder, java.util.logging.LogRecord); public java.lang.String _format(java.util.logging.LogRecord); public java.lang.String getHead(java.util.logging.Handler); public java.lang.String getTail(java.util.logging.Handler); public java.lang.String formatMessage(java.util.logging.LogRecord); public static void main(java.lang.String[]) throws java.lang.Exception; public static void go() throws java.lang.Exception; static {}; } On May 8, 2015, at 3:31 PM, Mark javam...@gmail.com wrote: I looking to use Solr search over the byte code in Classes and Jars. Does anyone know or have experience of Analyzers, Tokenizers, and Token Filters for such a task? Regards Mark
RE: indexing java byte code in classes / jars
There are a number of reverse compilers for Java. Some are quite good and very detailed, so long as the byte code has not been deliberately obfuscated. Of course the original sources would be better for picking up comments. But, then you'd need a java parser (the compiler front end), of which there are a few available as well. Hmm, this looks interesting ... https://searchcode.com/ -Original Message- From: Erik Hatcher [mailto:erik.hatc...@gmail.com] Sent: Friday, May 08, 2015 4:19 PM To: solr-user@lucene.apache.org Subject: Re: indexing java byte code in classes / jars Oh, and sorry, I omitted a couple of details: # creating the “java” core/collection bin/solr create -c java # I ran this from my Solr source code checkout, so that SolrLogFormatter.class just happened to be handy Erik On May 8, 2015, at 4:11 PM, Erik Hatcher erik.hatc...@gmail.com wrote: What kinds of searches do you want to run? Are you trying to extract class names, method names, and such and make those searchable? If that’s the case, you need some kind of “parser” to reverse engineer that information from .class and .jar files before feeding it to Solr, which would happen before analysis. Java itself comes with a javap command that can do this; whether this is the “best” way to go for your scenario I don’t know, but here’s an interesting example pasted below (using Solr 5.x). — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class test.txt bin/post -c java test.txt now search for coreInfoMap http://localhost:8983/solr/java/browse?q=coreInfoMap I tried to be cleverer and use the stdin option of bin/post, like this: javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class | bin/post -c java -url http://localhost:8983/solr/java/update/extract -type text/plain -params literal.id=SolrLogFormatter -out yes -d but something isn’t working right with the stdin detection like that (it does work to `cat test.txt | bin/post…` though, hmmm) test.txt looks like this, `cat test.txt`: Compiled from SolrLogFormatter.java public class org.apache.solr.SolrLogFormatter extends java.util.logging.Formatter { long startTime; long lastTime; java.util.Maporg.apache.solr.SolrLogFormatter$Method, java.lang.String methodAlias; public boolean shorterFormat; java.util.Maporg.apache.solr.core.SolrCore, org.apache.solr.SolrLogFormatter$CoreInfo coreInfoMap; public java.util.Mapjava.lang.String, java.lang.String classAliases; static java.lang.ThreadLocaljava.lang.String threadLocal; public org.apache.solr.SolrLogFormatter(); public void setShorterFormat(); public java.lang.String format(java.util.logging.LogRecord); public void appendThread(java.lang.StringBuilder, java.util.logging.LogRecord); public java.lang.String _format(java.util.logging.LogRecord); public java.lang.String getHead(java.util.logging.Handler); public java.lang.String getTail(java.util.logging.Handler); public java.lang.String formatMessage(java.util.logging.LogRecord); public static void main(java.lang.String[]) throws java.lang.Exception; public static void go() throws java.lang.Exception; static {}; } On May 8, 2015, at 3:31 PM, Mark javam...@gmail.com wrote: I looking to use Solr search over the byte code in Classes and Jars. Does anyone know or have experience of Analyzers, Tokenizers, and Token Filters for such a task? Regards Mark * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF *
Re: indexing java byte code in classes / jars
https://searchcode.com/ looks really interesting, however I want to crunch as much searchable aspects out of jars sititng on a classpath or under a project structure... Really early days so I'm open to any suggestions On 8 May 2015 at 22:09, Mark javam...@gmail.com wrote: To answer why bytecode - because mostly the use case I have is looking to index as much detail from jars/classes. extract class names, method names signatures packages / imports I am considering using ASM in order to generate an analysis view of the class The sort of usecases I have would be method / signature searches. For example; 1) show any classes with a method named parse* 2) show any classes with a method named parse that passes in a type *json* ...etc In the past I have written something to reverse out javadocs from just java bytecode, using solr would move this idea considerably much more powerful. Thanks for the suggestions so far On 8 May 2015 at 21:19, Erik Hatcher erik.hatc...@gmail.com wrote: Oh, and sorry, I omitted a couple of details: # creating the “java” core/collection bin/solr create -c java # I ran this from my Solr source code checkout, so that SolrLogFormatter.class just happened to be handy Erik On May 8, 2015, at 4:11 PM, Erik Hatcher erik.hatc...@gmail.com wrote: What kinds of searches do you want to run? Are you trying to extract class names, method names, and such and make those searchable? If that’s the case, you need some kind of “parser” to reverse engineer that information from .class and .jar files before feeding it to Solr, which would happen before analysis. Java itself comes with a javap command that can do this; whether this is the “best” way to go for your scenario I don’t know, but here’s an interesting example pasted below (using Solr 5.x). — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class test.txt bin/post -c java test.txt now search for coreInfoMap http://localhost:8983/solr/java/browse?q=coreInfoMap I tried to be cleverer and use the stdin option of bin/post, like this: javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class | bin/post -c java -url http://localhost:8983/solr/java/update/extract -type text/plain -params literal.id=SolrLogFormatter -out yes -d but something isn’t working right with the stdin detection like that (it does work to `cat test.txt | bin/post…` though, hmmm) test.txt looks like this, `cat test.txt`: Compiled from SolrLogFormatter.java public class org.apache.solr.SolrLogFormatter extends java.util.logging.Formatter { long startTime; long lastTime; java.util.Maporg.apache.solr.SolrLogFormatter$Method, java.lang.String methodAlias; public boolean shorterFormat; java.util.Maporg.apache.solr.core.SolrCore, org.apache.solr.SolrLogFormatter$CoreInfo coreInfoMap; public java.util.Mapjava.lang.String, java.lang.String classAliases; static java.lang.ThreadLocaljava.lang.String threadLocal; public org.apache.solr.SolrLogFormatter(); public void setShorterFormat(); public java.lang.String format(java.util.logging.LogRecord); public void appendThread(java.lang.StringBuilder, java.util.logging.LogRecord); public java.lang.String _format(java.util.logging.LogRecord); public java.lang.String getHead(java.util.logging.Handler); public java.lang.String getTail(java.util.logging.Handler); public java.lang.String formatMessage(java.util.logging.LogRecord); public static void main(java.lang.String[]) throws java.lang.Exception; public static void go() throws java.lang.Exception; static {}; } On May 8, 2015, at 3:31 PM, Mark javam...@gmail.com wrote: I looking to use Solr search over the byte code in Classes and Jars. Does anyone know or have experience of Analyzers, Tokenizers, and Token Filters for such a task? Regards Mark
Re: ZooKeeperException: Could not find configName for collection
Thank you Erick for your answer! I just tried to restart the first node and now the error is not yet there! Sorry for my too-early email :-) Bye! 2015-05-06 17:05 GMT+02:00 Erick Erickson erickerick...@gmail.com: Have you looked arond at your directories on disk? I'm _not_ talking about the admin UI here. The default is core discovery mode, which recursively looks under solr_home and thinks there's a core wherever it finds a core.properties file. If you find such a thing, rename it or remove the directory. Another alternative would be to push a configset named new_core up to Zookeeper, that might allow you to see (and then delete) the collection new_core belongs to. It looks like you tried to use the admin UI to create a core and it's all local or something like that. Best, Erick On Wed, May 6, 2015 at 4:00 AM, shacky shack...@gmail.com wrote: Hi list. I created a new collection on my new SolrCloud installation, the new collection is shown and replicated on all three nodes, but on the first node (only on this one) I get this error: new_core: org.apache.solr.common.cloud.ZooKeeperException:org.apache.solr.common.cloud.ZooKeeperException: Could not find configName for collection new_core found:null I cannot see any core named new_core on that node, and I also tried to remove it: root@index1:/opt/solr# ./bin/solr delete -c new_core Connecting to ZooKeeper at zk1,zk2,zk3 ERROR: Collection new_core not found! Could you help me, please? Thank you very much! Bye
Re: solr.war built from solr 4.7.2 not working
On 5/7/2015 11:52 PM, Rahul Singh wrote: ERROR - 2015-05-08 11:15:25.738; org.apache.solr.common.SolrException; null:java.lang.IllegalArgumentException: You cannot set an index-time bo ost on an unindexed field, or one that omits norms This seems to be the problem. You are trying to set an index-time boost on a field whose definition disables indexing or omits norms. That isn't allowed, because if the field isn't indexed, you can't search on it (and therefore can't boost), and the index-time boost is stored in the field norm. Thanks, Shawn
Re: SolrCloud indexing
I have just added a comment to the CWiki. Thanks again for your prompt answer Erick. Best, Vincenzo On Fri, May 8, 2015 at 12:39 AM, Erick Erickson erickerick...@gmail.com wrote: bq: ...forwards the index notation to itself and any replicas... That's just odd phrasing. All that means is that the document sent through the indexing process on the leader and all followers for a shard and is indexed independently on each. This is as opposed to the old master/slave situation where the master indexed the doc, but the slave got the indexed version as part of a segment when it replicated. Could you add a comment to the CWiki calling the phrasing out? It really is a bit mysterious. Best, Erick On Thu, May 7, 2015 at 2:18 PM, Vincenzo D'Amore v.dam...@gmail.com wrote: Thanks Shawn. Just to make the picture more clear, I'm trying to understand why a 3 node solrcloud cluster and a old style solr server take same time to index same documents. But in the wiki is written: If the machine is a leader, SolrCloud determines which shard the document should go to, forwards the document the leader for that shard, indexes the document for this shard, and *forwards the index notation to itself and any replicas*. https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud Could you please explain what does it mean forwards the index notation ? On the other hand, on solrcloud I have 3 shards and 2 replicas for each shard. So, every node is indexing all the documents and this explains why solrcloud consumes same time compared to an old-style solr server. On Thu, May 7, 2015 at 3:08 PM, Shawn Heisey apa...@elyograg.org wrote: On 5/7/2015 3:04 AM, Vincenzo D'Amore wrote: Thanks Erick. I'm not sure I got your answer. I try to recap, when the raw document has to be indexed, it will be forwarded to shard leader. Shard leader indexes the document for that shard, and then forwards the indexed document to any replicas. I want just be sure that when the raw document is forwarded from the leader to the replicas it will be indexed only one time on the shard leader. From what I understand replicas do not indexes, only the leader indexes. The document is indexed by all replicas. There is no way to forward the indexed document, it can only forward the source document ... so each replica must index it independently. The old-style master-slave replication (which existed long before SolrCloud) copies the finished Lucene segments, so only the master actually does indexing. SolrCloud doesn't have a master, only multiple replicas, one of which is elected leader, and replication only comes into the picture if there's a serious problem and Solr determines that it can't use the transaction log to recover the index. Thanks, Shawn -- Vincenzo D'Amore email: v.dam...@gmail.com skype: free.dev mobile: +39 349 8513251 -- Vincenzo D'Amore email: v.dam...@gmail.com skype: free.dev mobile: +39 349 8513251