mapreduce job using soirj 5
Hi, We recently started testing solr 5, our indexer creates mapreduce job that uses solrj5 to index documents to our SolrCloud. Until now, we used solr 4.10.3 with solrj 4.8.0. Our hadoop dist is cloudera 5. The problem is, solrj5 is using httpclient-4.3.1 while hadoop is installed with httpclient-4.2.5 and that causing us jar-hell because hadoop jars are being loaded first and solrj is using closeablehttpclient class which is in 4.3.1 but not in 4.2.5 Does anyone encounter that? and have a solution? or a workaround? Right now we are replacing the jar physically in each data node -- View this message in context: http://lucene.472066.n3.nabble.com/mapreduce-job-using-soirj-5-tp4212199.html Sent from the Solr - User mailing list archive at Nabble.com.
How to create concatenated token
Hi, I have a requirement to create the concatenated token of all the tokens created from the last item of my analyzer chain. *Suppose my analyzer chain is :* * tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory catenateAll=1 splitOnNumerics=1 preserveOriginal=1/filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=15 side=front /filter class=solr.PorterStemmerFilterFactory/* I want to create a concatenated token plugin to add at concatenated token along with the last token. e.g. Solr training *Porter:-* solr train Position 1 2 *Concatenated :-* solr train solrtrain Position 1 2 Please help me out. How to create custom filter for this requirement. With Regards Aman Tandon
Re: solr/lucene index merge and optimize performance improvement
Shenghua(Daniel) Wan wansheng...@gmail.com wrote: Actually, I am currently interested in how to boost merging/optimizing performance of single solr instance. We have the same challenge (we build static 900GB shards one at a time and the final optimization takes 8 hours with only 1 CPU core at 100%). I know that there is code for detecting SSDs, which should make merging faster (by running more merges in parallel?), but I am afraid that optimize (a single merge) is always single threaded. It seems to me that at least some of the different files making up a segment could be created in parallel, but I do not know how hard it would be to do so. - Toke Eskildsen
Re: How to create concatenated token
Can I ask you why you need to concatenate the tokens ? Maybe we can find a better solution to concat all the tokens in one single big token . I find it difficult to understand the reasons behind tokenising, token filtering and then un-tokenizing again :) It would be great if you explain a little bit better what you would like to do ! Cheers 2015-06-16 13:26 GMT+01:00 Aman Tandon amantandon...@gmail.com: Hi, I have a requirement to create the concatenated token of all the tokens created from the last item of my analyzer chain. *Suppose my analyzer chain is :* * tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory catenateAll=1 splitOnNumerics=1 preserveOriginal=1/filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=15 side=front /filter class=solr.PorterStemmerFilterFactory/* I want to create a concatenated token plugin to add at concatenated token along with the last token. e.g. Solr training *Porter:-* solr train Position 1 2 *Concatenated :-* solr train solrtrain Position 1 2 Please help me out. How to create custom filter for this requirement. With Regards Aman Tandon -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Highlight in Velocity UI on Google Chrome
Hi, I was testing the highlight feature and played with the techproducts example. It appears that the highlighting works on Mozilla Firefox, but not on Google Chrome. For your information Benjamin
Re: Do we need to add docValues=true to _version_ field in schema.xml?
Did you look in the example schema files? None of them have _version_ set as docValues. Best, Erick On Tue, Jun 16, 2015 at 1:44 AM, forest_soup tanglin0...@gmail.com wrote: For the _version_ field in the schema.xml, do we need to set it be docValues=true? field name=_version_ type=long indexed=true stored=true/ As we noticed there are FieldCache for _version_ in the solr stats: http://lucene.472066.n3.nabble.com/file/n4212123/IMAGE%245A8381797719FDA9.jpg -- View this message in context: http://lucene.472066.n3.nabble.com/Do-we-need-to-add-docValues-true-to-version-field-in-schema-xml-tp4212123.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to create concatenated token
e.g. Intent for solr training: fq=id: 234, 456, 545 title(solr training) typo error e.g. Intent for solr training: fq=id:(234 456 545) title:(solr training) With Regards Aman Tandon On Tue, Jun 16, 2015 at 8:13 PM, Aman Tandon amantandon...@gmail.com wrote: We has some business logic to search the user query in user intent or finding the exact matching products. e.g. Intent for solr training: fq=id: 234, 456, 545 title(solr training) As we can see it is phrase query so it will took more time than the single stemmed token query. There are also 5-7 words phrase query. So we want to reduce the search time by implementing this feature. With Regards Aman Tandon On Tue, Jun 16, 2015 at 6:42 PM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Can I ask you why you need to concatenate the tokens ? Maybe we can find a better solution to concat all the tokens in one single big token . I find it difficult to understand the reasons behind tokenising, token filtering and then un-tokenizing again :) It would be great if you explain a little bit better what you would like to do ! Cheers 2015-06-16 13:26 GMT+01:00 Aman Tandon amantandon...@gmail.com: Hi, I have a requirement to create the concatenated token of all the tokens created from the last item of my analyzer chain. *Suppose my analyzer chain is :* * tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory catenateAll=1 splitOnNumerics=1 preserveOriginal=1/filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=15 side=front /filter class=solr.PorterStemmerFilterFactory/* I want to create a concatenated token plugin to add at concatenated token along with the last token. e.g. Solr training *Porter:-* solr train Position 1 2 *Concatenated :-* solr train solrtrain Position 1 2 Please help me out. How to create custom filter for this requirement. With Regards Aman Tandon -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Solr's suggester results
The suggesters are built to return whole fields. You _might_ be able to add multiple fragments to a multiValued entry and get fragments, I haven't tried that though and I suspect that actually you'd get the same thing.. This is an XY problem IMO. Please describe exactly what you're trying to accomplish, with examples rather than continue to pursue this path. It sounds like you want spellcheck or similar. The _point_ behind the suggesters is that they handle multiple-word suggestions by returning he whole field. So putting long text fields into them is not going to work. Best, Erick On Tue, Jun 16, 2015 at 1:46 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: in line : 2015-06-16 4:43 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Thanks Benedetti, I've change to the AnalyzingInfixLookup approach, and it is able to start searching from the middle of the field. However, is it possible to make the suggester to show only part of the content of the field (like 2 or 3 fields after), instead of the entire content/sentence, which can be quite long? I assume you use fields in the place of tokens. The answer is yes, I already said that in my previous mail, I invite you to read carefully the answers and the documentation linked ! Related the excessive dimensions of tokens. This is weird, what are you trying to autocomplete ? I really doubt would be useful for a user to see super long auto completed terms. Cheers Regards, Edwin On 15 June 2015 at 17:33, Alessandro Benedetti benedetti.ale...@gmail.com wrote: ehehe Edwin, I think you should read again the document I linked time ago : http://lucidworks.com/blog/solr-suggester/ The suggester you used is not meant to provide infix suggestions. The fuzzy suggester is working on a fuzzy basis , with the *starting* terms of a field content. What you are looking for is actually one of the Infix Suggesters. For example the AnalyzingInfixLookup approach. When working with Suggesters is important first to make a distinction : 1) Returning the full content of the field ( analysisInfix or Fuzzy) 2) Returning token(s) ( Free Text Suggester) Then the second difference is : 1) Infix suggestions ( from the middle of the field content) 2) Classic suggester ( from the beginning of the field content) Clarified that, will be quite simple to work with suggesters. Cheers 2015-06-15 9:28 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: I've indexed a rich-text documents with the following content: This is a testing rich text documents to test the uploading of files to Solr When I tried to use the suggestion, it return me the entire field in the content once I enter suggest?q=t. However, when I tried to search for q='rich', I don't get any results returned. This is my current configuration for the suggester: searchComponent name=suggest class=solr.SuggestComponent lst name=suggester str name=namemySuggester/str str name=lookupImplFuzzyLookupFactory/str str name=dictionaryImplDocumentDictionaryFactory/str str name=fieldSuggestion/str str name=suggestAnalyzerFieldTypesuggestType/str str name=buildOnStartuptrue/str str name=buildOnCommitfalse/str /lst /searchComponent requestHandler name=/suggest class=solr.SearchHandler startup=lazy lst name=defaults str name=wtjson/str str name=indenttrue/str str name=suggesttrue/str str name=suggest.count10/str str name=suggest.dictionarymySuggester/str /lst arr name=components strsuggest/str /arr /requestHandler Is it possible to allow the suggester to return something even from the middle of the sentence, and also not to return the entire sentence if the sentence. Perhaps it should just suggest the next 2 or 3 fields, and to return more fields as the users type. For example, When user type 'this', it should return 'This is a testing' When user type 'this is a testing', it should return 'This is a testing rich text documents'. Regards, Edwin -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: phrase matches returning near matches
yep seems that’s the answer. The highlighting is done separately by the rails app, so I’ll look into proper solr highlighting. thanks a lot for the use of your ears, much improved understanding! cheers, Alistair -- mov eax,1 mov ebx,0 int 80h On 16/06/2015 16:33, Erick Erickson erickerick...@gmail.com wrote: Hmmm. First, highlighting should work here. If you have it configured to work on the dc.description field. As to whether the phrase management changes is near enough, I pretty much guarantee it is. This is where the admin/analysis page can answer this type of question authoritatively since it's based exactly on your particular analysis chain. Best, Erick On Tue, Jun 16, 2015 at 8:25 AM, Alistair Young alistair.yo...@uhi.ac.uk wrote: yes prolly not a bug. The highlighting is on but nothing is highlighted. Perhaps this text is triggering it? 'consider the impacts of land management changes’ that would seem reasonable. It’s not a direct match so no highlighting (the highlighting does work on a direct match) but 'management changes’ must be near enough ‘manage change’ to trigger a result. Alistair -- mov eax,1 mov ebx,0 int 80h On 16/06/2015 16:18, Erick Erickson erickerick...@gmail.com wrote: I agree with Allesandro the behavior you're describing is _not_ correct at all given your description. So either 1 There's something interesting about your configuration that doesn't seem important that you haven't told us, although what it could be is a mystery to me too ;) 2 it's matching on something else. Note that the phrase has been stemmed, so something in there besides management might stem to manag and/or something other than changes might stem to chang and the two of _them_ happen to be next to each other. are managers changing? for instance. Or even something less likely. Perhaps turn on highlighting and see if it pops out? 3 you've uncovered a bug. Although I suspect others would have reported it and the unit tests would have barfed all over the place. One other thing you can do. Go to the admin/analysis page and turn on the verbose check box. Put management is undergoing many changes in both the query and index boxes. The result (it's kind of hard to read I'll admit) will include the position of each token after all the analysis is done. Phrase queries (without slop) should only be matching adjacent positions. So the question is whether the position info looks correct Best, Erick On Tue, Jun 16, 2015 at 4:40 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: According to your debug you are using a default Lucene Query Parser. This surprise me as i would expect with that query a match with distance 0 between the 2 terms . Are you sure nothing else is that field that matches the phrase query ? From the documentation Lucene supports finding words are a within a specific distance away. To do a proximity search use the tilde, ~, symbol at the end of a Phrase. For example to search for a apache and jakarta within 10 words of each other in a document use the search: jakarta apache~10 Cheers 2015-06-16 11:33 GMT+01:00 Alistair Young alistair.yo...@uhi.ac.uk: it¹s a useful behaviour. I¹d just like to understand where it¹s deciding the document is relevant. debug output is: lst name=debug str name=rawquerystringdc.description:manage change/str str name=querystringdc.description:manage change/str str name=parsedqueryPhraseQuery(dc.description:manag chang)/str str name=parsedquery_toStringdc.description:manag chang/str lst name=explain str name=tst:test 1.2008798 = (MATCH) weight(dc.description:manag chang in 221) [DefaultSimilarity], result of: 1.2008798 = fieldWeight in 221, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = phraseFreq=1.0 9.6070385 = idf(), sum of: 4.0365543 = idf(docFreq=101, maxDocs=2125) 5.5704846 = idf(docFreq=21, maxDocs=2125) 0.125 = fieldNorm(doc=221) /str /lst str name=QParserLuceneQParser/str lst name=timing double name=time41.0/double lst name=prepare double name=time3.0/double lst name=query double name=time0.0/double /lst lst name=facet double name=time0.0/double /lst lst name=mlt double name=time0.0/double /lst lst name=highlight double name=time0.0/double /lst lst name=stats double name=time0.0/double /lst lst name=debug double name=time0.0/double /lst /lst lst name=process double name=time35.0/double lst name=query double name=time0.0/double /lst lst name=facet double name=time0.0/double /lst lst name=mlt double name=time0.0/double /lst lst name=highlight double name=time0.0/double /lst lst name=stats
Re: phrase matches returning near matches
I agree with Allesandro the behavior you're describing is _not_ correct at all given your description. So either 1 There's something interesting about your configuration that doesn't seem important that you haven't told us, although what it could be is a mystery to me too ;) 2 it's matching on something else. Note that the phrase has been stemmed, so something in there besides management might stem to manag and/or something other than changes might stem to chang and the two of _them_ happen to be next to each other. are managers changing? for instance. Or even something less likely. Perhaps turn on highlighting and see if it pops out? 3 you've uncovered a bug. Although I suspect others would have reported it and the unit tests would have barfed all over the place. One other thing you can do. Go to the admin/analysis page and turn on the verbose check box. Put management is undergoing many changes in both the query and index boxes. The result (it's kind of hard to read I'll admit) will include the position of each token after all the analysis is done. Phrase queries (without slop) should only be matching adjacent positions. So the question is whether the position info looks correct Best, Erick On Tue, Jun 16, 2015 at 4:40 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: According to your debug you are using a default Lucene Query Parser. This surprise me as i would expect with that query a match with distance 0 between the 2 terms . Are you sure nothing else is that field that matches the phrase query ? From the documentation Lucene supports finding words are a within a specific distance away. To do a proximity search use the tilde, ~, symbol at the end of a Phrase. For example to search for a apache and jakarta within 10 words of each other in a document use the search: jakarta apache~10 Cheers 2015-06-16 11:33 GMT+01:00 Alistair Young alistair.yo...@uhi.ac.uk: it¹s a useful behaviour. I¹d just like to understand where it¹s deciding the document is relevant. debug output is: lst name=debug str name=rawquerystringdc.description:manage change/str str name=querystringdc.description:manage change/str str name=parsedqueryPhraseQuery(dc.description:manag chang)/str str name=parsedquery_toStringdc.description:manag chang/str lst name=explain str name=tst:test 1.2008798 = (MATCH) weight(dc.description:manag chang in 221) [DefaultSimilarity], result of: 1.2008798 = fieldWeight in 221, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = phraseFreq=1.0 9.6070385 = idf(), sum of: 4.0365543 = idf(docFreq=101, maxDocs=2125) 5.5704846 = idf(docFreq=21, maxDocs=2125) 0.125 = fieldNorm(doc=221) /str /lst str name=QParserLuceneQParser/str lst name=timing double name=time41.0/double lst name=prepare double name=time3.0/double lst name=query double name=time0.0/double /lst lst name=facet double name=time0.0/double /lst lst name=mlt double name=time0.0/double /lst lst name=highlight double name=time0.0/double /lst lst name=stats double name=time0.0/double /lst lst name=debug double name=time0.0/double /lst /lst lst name=process double name=time35.0/double lst name=query double name=time0.0/double /lst lst name=facet double name=time0.0/double /lst lst name=mlt double name=time0.0/double /lst lst name=highlight double name=time0.0/double /lst lst name=stats double name=time0.0/double /lst lst name=debug double name=time35.0/double /lst /lst /lst /lst thanks, Alistair -- mov eax,1 mov ebx,0 int 80h On 16/06/2015 11:26, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Can you show us how the query is parsed ? You didn't tell us nothing about the query parser you are using. Enable the debugQuery=true will show you how the query is parsed and this will be quite useful for us. Cheers 2015-06-16 11:22 GMT+01:00 Alistair Young alistair.yo...@uhi.ac.uk: Hiya, I've been looking for documentation that would point to where I could modify or explain why 'near neighbours' are returned from a phrase search. If I search for: manage change I get back a document that contains this will help in your management of lots more words... changes. It's relevant but I'd like to understand why solr is returning it. Is it a combination of fuzzy/slop? The distance between the two variations of the two words in the document is quite large. thanks, Alistair -- mov eax,1 mov ebx,0 int 80h -- -- Benedetti Alessandro Visiting card :
Re: phrase matches returning near matches
Hmmm. First, highlighting should work here. If you have it configured to work on the dc.description field. As to whether the phrase management changes is near enough, I pretty much guarantee it is. This is where the admin/analysis page can answer this type of question authoritatively since it's based exactly on your particular analysis chain. Best, Erick On Tue, Jun 16, 2015 at 8:25 AM, Alistair Young alistair.yo...@uhi.ac.uk wrote: yes prolly not a bug. The highlighting is on but nothing is highlighted. Perhaps this text is triggering it? 'consider the impacts of land management changes’ that would seem reasonable. It’s not a direct match so no highlighting (the highlighting does work on a direct match) but 'management changes’ must be near enough ‘manage change’ to trigger a result. Alistair -- mov eax,1 mov ebx,0 int 80h On 16/06/2015 16:18, Erick Erickson erickerick...@gmail.com wrote: I agree with Allesandro the behavior you're describing is _not_ correct at all given your description. So either 1 There's something interesting about your configuration that doesn't seem important that you haven't told us, although what it could be is a mystery to me too ;) 2 it's matching on something else. Note that the phrase has been stemmed, so something in there besides management might stem to manag and/or something other than changes might stem to chang and the two of _them_ happen to be next to each other. are managers changing? for instance. Or even something less likely. Perhaps turn on highlighting and see if it pops out? 3 you've uncovered a bug. Although I suspect others would have reported it and the unit tests would have barfed all over the place. One other thing you can do. Go to the admin/analysis page and turn on the verbose check box. Put management is undergoing many changes in both the query and index boxes. The result (it's kind of hard to read I'll admit) will include the position of each token after all the analysis is done. Phrase queries (without slop) should only be matching adjacent positions. So the question is whether the position info looks correct Best, Erick On Tue, Jun 16, 2015 at 4:40 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: According to your debug you are using a default Lucene Query Parser. This surprise me as i would expect with that query a match with distance 0 between the 2 terms . Are you sure nothing else is that field that matches the phrase query ? From the documentation Lucene supports finding words are a within a specific distance away. To do a proximity search use the tilde, ~, symbol at the end of a Phrase. For example to search for a apache and jakarta within 10 words of each other in a document use the search: jakarta apache~10 Cheers 2015-06-16 11:33 GMT+01:00 Alistair Young alistair.yo...@uhi.ac.uk: it¹s a useful behaviour. I¹d just like to understand where it¹s deciding the document is relevant. debug output is: lst name=debug str name=rawquerystringdc.description:manage change/str str name=querystringdc.description:manage change/str str name=parsedqueryPhraseQuery(dc.description:manag chang)/str str name=parsedquery_toStringdc.description:manag chang/str lst name=explain str name=tst:test 1.2008798 = (MATCH) weight(dc.description:manag chang in 221) [DefaultSimilarity], result of: 1.2008798 = fieldWeight in 221, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = phraseFreq=1.0 9.6070385 = idf(), sum of: 4.0365543 = idf(docFreq=101, maxDocs=2125) 5.5704846 = idf(docFreq=21, maxDocs=2125) 0.125 = fieldNorm(doc=221) /str /lst str name=QParserLuceneQParser/str lst name=timing double name=time41.0/double lst name=prepare double name=time3.0/double lst name=query double name=time0.0/double /lst lst name=facet double name=time0.0/double /lst lst name=mlt double name=time0.0/double /lst lst name=highlight double name=time0.0/double /lst lst name=stats double name=time0.0/double /lst lst name=debug double name=time0.0/double /lst /lst lst name=process double name=time35.0/double lst name=query double name=time0.0/double /lst lst name=facet double name=time0.0/double /lst lst name=mlt double name=time0.0/double /lst lst name=highlight double name=time0.0/double /lst lst name=stats double name=time0.0/double /lst lst name=debug double name=time35.0/double /lst /lst /lst /lst thanks, Alistair -- mov eax,1 mov ebx,0 int 80h On 16/06/2015 11:26, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Can you show us how the query is parsed ? You
Re: mapreduce job using soirj 5
Sounds like a question better asked in one of the Cloudera support forums, 'cause all I can do is guess ;). I suppose, theoretically, that you could check out the Solr5 code and substitute the httpclient-4.2.5.jar in the build system, recompile and go, but that's totally a guess based on zero knowledge of whether compiling Solr with an earlier httpclient would even work. Frankly, though, that sounds like more work than distributing the older jar to the data nodes. Best, Erick On Tue, Jun 16, 2015 at 7:23 AM, adfel70 adfe...@gmail.com wrote: Hi, We recently started testing solr 5, our indexer creates mapreduce job that uses solrj5 to index documents to our SolrCloud. Until now, we used solr 4.10.3 with solrj 4.8.0. Our hadoop dist is cloudera 5. The problem is, solrj5 is using httpclient-4.3.1 while hadoop is installed with httpclient-4.2.5 and that causing us jar-hell because hadoop jars are being loaded first and solrj is using closeablehttpclient class which is in 4.3.1 but not in 4.2.5 Does anyone encounter that? and have a solution? or a workaround? Right now we are replacing the jar physically in each data node -- View this message in context: http://lucene.472066.n3.nabble.com/mapreduce-job-using-soirj-5-tp4212199.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: phrase matches returning near matches
yes prolly not a bug. The highlighting is on but nothing is highlighted. Perhaps this text is triggering it? 'consider the impacts of land management changes’ that would seem reasonable. It’s not a direct match so no highlighting (the highlighting does work on a direct match) but 'management changes’ must be near enough ‘manage change’ to trigger a result. Alistair -- mov eax,1 mov ebx,0 int 80h On 16/06/2015 16:18, Erick Erickson erickerick...@gmail.com wrote: I agree with Allesandro the behavior you're describing is _not_ correct at all given your description. So either 1 There's something interesting about your configuration that doesn't seem important that you haven't told us, although what it could be is a mystery to me too ;) 2 it's matching on something else. Note that the phrase has been stemmed, so something in there besides management might stem to manag and/or something other than changes might stem to chang and the two of _them_ happen to be next to each other. are managers changing? for instance. Or even something less likely. Perhaps turn on highlighting and see if it pops out? 3 you've uncovered a bug. Although I suspect others would have reported it and the unit tests would have barfed all over the place. One other thing you can do. Go to the admin/analysis page and turn on the verbose check box. Put management is undergoing many changes in both the query and index boxes. The result (it's kind of hard to read I'll admit) will include the position of each token after all the analysis is done. Phrase queries (without slop) should only be matching adjacent positions. So the question is whether the position info looks correct Best, Erick On Tue, Jun 16, 2015 at 4:40 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: According to your debug you are using a default Lucene Query Parser. This surprise me as i would expect with that query a match with distance 0 between the 2 terms . Are you sure nothing else is that field that matches the phrase query ? From the documentation Lucene supports finding words are a within a specific distance away. To do a proximity search use the tilde, ~, symbol at the end of a Phrase. For example to search for a apache and jakarta within 10 words of each other in a document use the search: jakarta apache~10 Cheers 2015-06-16 11:33 GMT+01:00 Alistair Young alistair.yo...@uhi.ac.uk: it¹s a useful behaviour. I¹d just like to understand where it¹s deciding the document is relevant. debug output is: lst name=debug str name=rawquerystringdc.description:manage change/str str name=querystringdc.description:manage change/str str name=parsedqueryPhraseQuery(dc.description:manag chang)/str str name=parsedquery_toStringdc.description:manag chang/str lst name=explain str name=tst:test 1.2008798 = (MATCH) weight(dc.description:manag chang in 221) [DefaultSimilarity], result of: 1.2008798 = fieldWeight in 221, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = phraseFreq=1.0 9.6070385 = idf(), sum of: 4.0365543 = idf(docFreq=101, maxDocs=2125) 5.5704846 = idf(docFreq=21, maxDocs=2125) 0.125 = fieldNorm(doc=221) /str /lst str name=QParserLuceneQParser/str lst name=timing double name=time41.0/double lst name=prepare double name=time3.0/double lst name=query double name=time0.0/double /lst lst name=facet double name=time0.0/double /lst lst name=mlt double name=time0.0/double /lst lst name=highlight double name=time0.0/double /lst lst name=stats double name=time0.0/double /lst lst name=debug double name=time0.0/double /lst /lst lst name=process double name=time35.0/double lst name=query double name=time0.0/double /lst lst name=facet double name=time0.0/double /lst lst name=mlt double name=time0.0/double /lst lst name=highlight double name=time0.0/double /lst lst name=stats double name=time0.0/double /lst lst name=debug double name=time35.0/double /lst /lst /lst /lst thanks, Alistair -- mov eax,1 mov ebx,0 int 80h On 16/06/2015 11:26, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Can you show us how the query is parsed ? You didn't tell us nothing about the query parser you are using. Enable the debugQuery=true will show you how the query is parsed and this will be quite useful for us. Cheers 2015-06-16 11:22 GMT+01:00 Alistair Young alistair.yo...@uhi.ac.uk: Hiya, I've been looking for documentation that would point to where I could modify or explain why 'near neighbours' are returned from a phrase search. If I search for: manage change I
TikaEntityProcessor Not Finding My Files
Hi, there's a guy who's already asked a question similar to this and I'm basically going off what he did here. It's exactly what I'm doing which is taking a file path from a database and using TikaEntityProcessor to analyze the document. The link to his question is here. http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-td856965.html#a3524905 http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-td856965.html#a3524905 His problem was version issues with Tika but I'm using a version that is about five years older so I'm not sure if it's still issues with the current version of Tika or if I'm missing something extremely obvious (which is possible I'm extremely new to Solr) This is my data configuration. TextContentURL is the filepath! dataConfig dataSource name=ds-db type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/EDMS_Metadata user=root password=** / dataSource name=ds-file type=BinFileDataSource/ document name=doc1 entity name=db-data dataSource=ds-db query=select TextContentURL as 'id',ID,Title,AuthorCreator from MasterIndex field column=TextContentURL name=id / field column=Title name=title / /entity entity name=file dataSource=ds-file processor=TikaEntityProcessor url=${db-data.TextContentURL} format=text field column=text name=text / /entity /document /dataConfig I'd like to note that when I delete the second entity and just run the database draw it works fine. I can run and query and I get this output when I run a faceted search response: { numFound: 283, start: 0, docs: [ { id: /home/paden/Documents/LWP_Files/BIGDATA/6220106.pdf, title: ENGINEERING INITIATION, }, This means that it is pulling the document filepath JUST FINE. The id is the correct filepath. But when I re-add the second entity it logs errors saying it can't find the file? Am I missing something obvious? -- View this message in context: http://lucene.472066.n3.nabble.com/TikaEntityProcessor-Not-Finding-My-Files-tp4212241.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to create concatenated token
We has some business logic to search the user query in user intent or finding the exact matching products. e.g. Intent for solr training: fq=id: 234, 456, 545 title(solr training) As we can see it is phrase query so it will took more time than the single stemmed token query. There are also 5-7 words phrase query. So we want to reduce the search time by implementing this feature. With Regards Aman Tandon On Tue, Jun 16, 2015 at 6:42 PM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Can I ask you why you need to concatenate the tokens ? Maybe we can find a better solution to concat all the tokens in one single big token . I find it difficult to understand the reasons behind tokenising, token filtering and then un-tokenizing again :) It would be great if you explain a little bit better what you would like to do ! Cheers 2015-06-16 13:26 GMT+01:00 Aman Tandon amantandon...@gmail.com: Hi, I have a requirement to create the concatenated token of all the tokens created from the last item of my analyzer chain. *Suppose my analyzer chain is :* * tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory catenateAll=1 splitOnNumerics=1 preserveOriginal=1/filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=15 side=front /filter class=solr.PorterStemmerFilterFactory/* I want to create a concatenated token plugin to add at concatenated token along with the last token. e.g. Solr training *Porter:-* solr train Position 1 2 *Concatenated :-* solr train solrtrain Position 1 2 Please help me out. How to create custom filter for this requirement. With Regards Aman Tandon -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: mapreduce job using soirj 5
On 6/16/2015 9:24 AM, Erick Erickson wrote: Sounds like a question better asked in one of the Cloudera support forums, 'cause all I can do is guess ;). I suppose, theoretically, that you could check out the Solr5 code and substitute the httpclient-4.2.5.jar in the build system, recompile and go, but that's totally a guess based on zero knowledge of whether compiling Solr with an earlier httpclient would even work. Frankly, though, that sounds like more work than distributing the older jar to the data nodes. Best, Erick On Tue, Jun 16, 2015 at 7:23 AM, adfel70 adfe...@gmail.com wrote: Hi, We recently started testing solr 5, our indexer creates mapreduce job that uses solrj5 to index documents to our SolrCloud. Until now, we used solr 4.10.3 with solrj 4.8.0. Our hadoop dist is cloudera 5. The problem is, solrj5 is using httpclient-4.3.1 while hadoop is installed with httpclient-4.2.5 In addition to what Erick said: When I upgraded the build system in Solr to from HttpClient 4.2 to 4.3, no code changes were required. It worked immediately, and all tests passed. It is likely that you can simply use HttpClient 4.3.1 everywhere and hadoop will work properly. This is one of Apache's design goals for software libraries. It's not always possible to achieve it, but it is something we always try to do. Thanks, Shawn
Re: solr/lucene index merge and optimize performance improvement
Hi, Toke, Did you try MapReduce with solr? I think it should be a good fit for your use case. On Tue, Jun 16, 2015 at 5:02 AM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Shenghua(Daniel) Wan wansheng...@gmail.com wrote: Actually, I am currently interested in how to boost merging/optimizing performance of single solr instance. We have the same challenge (we build static 900GB shards one at a time and the final optimization takes 8 hours with only 1 CPU core at 100%). I know that there is code for detecting SSDs, which should make merging faster (by running more merges in parallel?), but I am afraid that optimize (a single merge) is always single threaded. It seems to me that at least some of the different files making up a segment could be created in parallel, but I do not know how hard it would be to do so. - Toke Eskildsen -- Regards, Shenghua (Daniel) Wan
Re: phrase matches returning near matches
This might be an issue with your stemmer. management being stemmed to manage, changes being stemmed to change then the terms match. You can use the solr admin UI to test your indexing and query analysis chains to see if this is happening. On 6/16/2015 3:22 AM, Alistair Young wrote: Hiya, I've been looking for documentation that would point to where I could modify or explain why 'near neighbours' are returned from a phrase search. If I search for: manage change I get back a document that contains this will help in your management of lots more words... changes. It's relevant but I'd like to understand why solr is returning it. Is it a combination of fuzzy/slop? The distance between the two variations of the two words in the document is quite large. thanks, Alistair -- mov eax,1 mov ebx,0 int 80h
Re: TikaEntityProcessor Not Finding My Files
I thought it might be useful to list the logging errors as well. Here they are. There are just three. WARN FileDataSourceFileDataSource.basePath is empty. Resolving to: /home/paden/Downloads/solr-5.1.0/server/. ERRORDocBuilder Exception while processing: file document : SolrInputDocument(fields: []):org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.RuntimeException: java.io.FileNotFoundException: Could not find file: (resolved to: /home/paden/Downloads/solr-5.1.0/server/. ERROR DataImporter Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.RuntimeException: java.io.FileNotFoundException: Could not find file: (resolved to: /home/paden/Downloads/solr-5.1.0/server/. -- View this message in context: http://lucene.472066.n3.nabble.com/TikaEntityProcessor-Not-Finding-My-Files-tp4212241p4212252.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Do we need to add docValues=true to _version_ field in schema.xml?
: For the _version_ field in the schema.xml, do we need to set it be : docValues=true? you *can* add docValues, but it is not required. There is an open discussion about wether we should add docValues to the _version_ field (or even switch completely to indexed=false) in this jira... https://issues.apache.org/jira/browse/SOLR-6337 ...if you try it out and find it works better for you, please post a comment with your experiences and any annecdotal performance impacts you notice. (real world use cases/observations are always helpful) -Hoss http://www.lucidworks.com/
Re: mapreduce job using soirj 5
Hadoop has a switch that lets you use your jar rather than the one hadoop carries. google for HADOOP_OPTS good luck. On Tue, Jun 16, 2015 at 7:23 AM, adfel70 adfe...@gmail.com wrote: Hi, We recently started testing solr 5, our indexer creates mapreduce job that uses solrj5 to index documents to our SolrCloud. Until now, we used solr 4.10.3 with solrj 4.8.0. Our hadoop dist is cloudera 5. The problem is, solrj5 is using httpclient-4.3.1 while hadoop is installed with httpclient-4.2.5 and that causing us jar-hell because hadoop jars are being loaded first and solrj is using closeablehttpclient class which is in 4.3.1 but not in 4.2.5 Does anyone encounter that? and have a solution? or a workaround? Right now we are replacing the jar physically in each data node -- View this message in context: http://lucene.472066.n3.nabble.com/mapreduce-job-using-soirj-5-tp4212199.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shenghua (Daniel) Wan
Re: Facet on same field in different ways
: Have you tried this syntax ? : : facet=truefacet.field={!ex=st key=terms facet.limit=5 : facet.prefix=ap}query_termsfacet.field={!key=terms2 : facet.limit=1}query_termsrows=0facet.mincount=1 : : This seems the proper syntax, I found it here : yeah, local params are supported for specifying facet options like this. Aparently it never got documented, but i've added a comment to the Faceting page with techproducts example anyone can try with solr out ofthe box... https://cwiki.apache.org/confluence/display/solr/Faceting?focusedCommentId=58851733#comment-58851733 -Hoss http://www.lucidworks.com/
Re: Highlight in Velocity UI on Google Chrome
I think it makes it bold on bold, which won't be particularly visible. On Tue, Jun 16, 2015, at 06:52 AM, Sznajder ForMailingList wrote: Hi, I was testing the highlight feature and played with the techproducts example. It appears that the highlighting works on Mozilla Firefox, but not on Google Chrome. For your information Benjamin
Re: Facet on same field in different ways
Thanks guys. The syntax facet.field={!key=abc facet.limit=10}facetFieldName works. On Tue, Jun 16, 2015 at 11:22 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : Have you tried this syntax ? : : facet=truefacet.field={!ex=st key=terms facet.limit=5 : facet.prefix=ap}query_termsfacet.field={!key=terms2 : facet.limit=1}query_termsrows=0facet.mincount=1 : : This seems the proper syntax, I found it here : yeah, local params are supported for specifying facet options like this. Aparently it never got documented, but i've added a comment to the Faceting page with techproducts example anyone can try with solr out ofthe box... https://cwiki.apache.org/confluence/display/solr/Faceting?focusedCommentId=58851733#comment-58851733 -Hoss http://www.lucidworks.com/
Re: How to create concatenated token
Hi, Any guesses, how could I achieve this behaviour. With Regards Aman Tandon On Tue, Jun 16, 2015 at 8:15 PM, Aman Tandon amantandon...@gmail.com wrote: e.g. Intent for solr training: fq=id: 234, 456, 545 title(solr training) typo error e.g. Intent for solr training: fq=id:(234 456 545) title:(solr training) With Regards Aman Tandon On Tue, Jun 16, 2015 at 8:13 PM, Aman Tandon amantandon...@gmail.com wrote: We has some business logic to search the user query in user intent or finding the exact matching products. e.g. Intent for solr training: fq=id: 234, 456, 545 title(solr training) As we can see it is phrase query so it will took more time than the single stemmed token query. There are also 5-7 words phrase query. So we want to reduce the search time by implementing this feature. With Regards Aman Tandon On Tue, Jun 16, 2015 at 6:42 PM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Can I ask you why you need to concatenate the tokens ? Maybe we can find a better solution to concat all the tokens in one single big token . I find it difficult to understand the reasons behind tokenising, token filtering and then un-tokenizing again :) It would be great if you explain a little bit better what you would like to do ! Cheers 2015-06-16 13:26 GMT+01:00 Aman Tandon amantandon...@gmail.com: Hi, I have a requirement to create the concatenated token of all the tokens created from the last item of my analyzer chain. *Suppose my analyzer chain is :* * tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory catenateAll=1 splitOnNumerics=1 preserveOriginal=1/filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=15 side=front /filter class=solr.PorterStemmerFilterFactory/* I want to create a concatenated token plugin to add at concatenated token along with the last token. e.g. Solr training *Porter:-* solr train Position 1 2 *Concatenated :-* solr train solrtrain Position 1 2 Please help me out. How to create custom filter for this requirement. With Regards Aman Tandon -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Solr's suggester results
The long content is from when I tried to index PDF files. As some PDF files has alot of words in the content, it will lead to the *UTF8 encoding is longer than the max length 32766 error.* I think the problem is the content size of the PDF file exceed 32766 characters? I'm trying to accomplish to be able to index documents that can be of any size (even those with very large contents), and build the suggester from there. Also, when I do a search, it shouldn't be returning whole fields, but just to return a portion of the sentence. Regards, Edwin On 16 June 2015 at 23:02, Erick Erickson erickerick...@gmail.com wrote: The suggesters are built to return whole fields. You _might_ be able to add multiple fragments to a multiValued entry and get fragments, I haven't tried that though and I suspect that actually you'd get the same thing.. This is an XY problem IMO. Please describe exactly what you're trying to accomplish, with examples rather than continue to pursue this path. It sounds like you want spellcheck or similar. The _point_ behind the suggesters is that they handle multiple-word suggestions by returning he whole field. So putting long text fields into them is not going to work. Best, Erick On Tue, Jun 16, 2015 at 1:46 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: in line : 2015-06-16 4:43 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Thanks Benedetti, I've change to the AnalyzingInfixLookup approach, and it is able to start searching from the middle of the field. However, is it possible to make the suggester to show only part of the content of the field (like 2 or 3 fields after), instead of the entire content/sentence, which can be quite long? I assume you use fields in the place of tokens. The answer is yes, I already said that in my previous mail, I invite you to read carefully the answers and the documentation linked ! Related the excessive dimensions of tokens. This is weird, what are you trying to autocomplete ? I really doubt would be useful for a user to see super long auto completed terms. Cheers Regards, Edwin On 15 June 2015 at 17:33, Alessandro Benedetti benedetti.ale...@gmail.com wrote: ehehe Edwin, I think you should read again the document I linked time ago : http://lucidworks.com/blog/solr-suggester/ The suggester you used is not meant to provide infix suggestions. The fuzzy suggester is working on a fuzzy basis , with the *starting* terms of a field content. What you are looking for is actually one of the Infix Suggesters. For example the AnalyzingInfixLookup approach. When working with Suggesters is important first to make a distinction : 1) Returning the full content of the field ( analysisInfix or Fuzzy) 2) Returning token(s) ( Free Text Suggester) Then the second difference is : 1) Infix suggestions ( from the middle of the field content) 2) Classic suggester ( from the beginning of the field content) Clarified that, will be quite simple to work with suggesters. Cheers 2015-06-15 9:28 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: I've indexed a rich-text documents with the following content: This is a testing rich text documents to test the uploading of files to Solr When I tried to use the suggestion, it return me the entire field in the content once I enter suggest?q=t. However, when I tried to search for q='rich', I don't get any results returned. This is my current configuration for the suggester: searchComponent name=suggest class=solr.SuggestComponent lst name=suggester str name=namemySuggester/str str name=lookupImplFuzzyLookupFactory/str str name=dictionaryImplDocumentDictionaryFactory/str str name=fieldSuggestion/str str name=suggestAnalyzerFieldTypesuggestType/str str name=buildOnStartuptrue/str str name=buildOnCommitfalse/str /lst /searchComponent requestHandler name=/suggest class=solr.SearchHandler startup=lazy lst name=defaults str name=wtjson/str str name=indenttrue/str str name=suggesttrue/str str name=suggest.count10/str str name=suggest.dictionarymySuggester/str /lst arr name=components strsuggest/str /arr /requestHandler Is it possible to allow the suggester to return something even from the middle of the sentence, and also not to return the entire sentence if the sentence. Perhaps it should just suggest the next 2 or 3 fields, and to return more fields as the users type. For example, When user type 'this', it should return 'This is a testing' When user type 'this is a testing', it should return 'This is a testing rich text documents'. Regards, Edwin -- -- Benedetti
Re: Solr's suggester results
Have you looked at spellchecker? Because that sound much more like what you're asking about than suggester. Spell checking is more what you're asking for, have you even looked at that after it was suggested? bq: Also, when I do a search, it shouldn't be returning whole fields, but just to return a portion of the sentence This is what highlighting is built for. Really, I recommend you take the time to do some familiarization with the whole search space and Solr. The excellent book here: http://www.amazon.com/Solr-Action-Trey-Grainger/dp/1617291021/ref=sr_1_1?ie=UTF8qid=1434513284sr=8-1keywords=apache+solrpebp=1434513287267perid=0YRK508J0HJ1N3BAX20E will give you the grounding you need to get the most out of Solr. Best, Erick On Tue, Jun 16, 2015 at 8:27 PM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: The long content is from when I tried to index PDF files. As some PDF files has alot of words in the content, it will lead to the *UTF8 encoding is longer than the max length 32766 error.* I think the problem is the content size of the PDF file exceed 32766 characters? I'm trying to accomplish to be able to index documents that can be of any size (even those with very large contents), and build the suggester from there. Also, when I do a search, it shouldn't be returning whole fields, but just to return a portion of the sentence. Regards, Edwin On 16 June 2015 at 23:02, Erick Erickson erickerick...@gmail.com wrote: The suggesters are built to return whole fields. You _might_ be able to add multiple fragments to a multiValued entry and get fragments, I haven't tried that though and I suspect that actually you'd get the same thing.. This is an XY problem IMO. Please describe exactly what you're trying to accomplish, with examples rather than continue to pursue this path. It sounds like you want spellcheck or similar. The _point_ behind the suggesters is that they handle multiple-word suggestions by returning he whole field. So putting long text fields into them is not going to work. Best, Erick On Tue, Jun 16, 2015 at 1:46 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: in line : 2015-06-16 4:43 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Thanks Benedetti, I've change to the AnalyzingInfixLookup approach, and it is able to start searching from the middle of the field. However, is it possible to make the suggester to show only part of the content of the field (like 2 or 3 fields after), instead of the entire content/sentence, which can be quite long? I assume you use fields in the place of tokens. The answer is yes, I already said that in my previous mail, I invite you to read carefully the answers and the documentation linked ! Related the excessive dimensions of tokens. This is weird, what are you trying to autocomplete ? I really doubt would be useful for a user to see super long auto completed terms. Cheers Regards, Edwin On 15 June 2015 at 17:33, Alessandro Benedetti benedetti.ale...@gmail.com wrote: ehehe Edwin, I think you should read again the document I linked time ago : http://lucidworks.com/blog/solr-suggester/ The suggester you used is not meant to provide infix suggestions. The fuzzy suggester is working on a fuzzy basis , with the *starting* terms of a field content. What you are looking for is actually one of the Infix Suggesters. For example the AnalyzingInfixLookup approach. When working with Suggesters is important first to make a distinction : 1) Returning the full content of the field ( analysisInfix or Fuzzy) 2) Returning token(s) ( Free Text Suggester) Then the second difference is : 1) Infix suggestions ( from the middle of the field content) 2) Classic suggester ( from the beginning of the field content) Clarified that, will be quite simple to work with suggesters. Cheers 2015-06-15 9:28 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: I've indexed a rich-text documents with the following content: This is a testing rich text documents to test the uploading of files to Solr When I tried to use the suggestion, it return me the entire field in the content once I enter suggest?q=t. However, when I tried to search for q='rich', I don't get any results returned. This is my current configuration for the suggester: searchComponent name=suggest class=solr.SuggestComponent lst name=suggester str name=namemySuggester/str str name=lookupImplFuzzyLookupFactory/str str name=dictionaryImplDocumentDictionaryFactory/str str name=fieldSuggestion/str str name=suggestAnalyzerFieldTypesuggestType/str str name=buildOnStartuptrue/str str name=buildOnCommitfalse/str /lst /searchComponent requestHandler name=/suggest class=solr.SearchHandler startup=lazy lst
Re: Solr's suggester results
Yes I've looked at that before, but I was told that the newer version of Solr has its own suggester, and does not need to use spellchecker anymore? So it's not necessary to use the spellechecker inside suggester anymore? Regards, Edwin On 17 June 2015 at 11:56, Erick Erickson erickerick...@gmail.com wrote: Have you looked at spellchecker? Because that sound much more like what you're asking about than suggester. Spell checking is more what you're asking for, have you even looked at that after it was suggested? bq: Also, when I do a search, it shouldn't be returning whole fields, but just to return a portion of the sentence This is what highlighting is built for. Really, I recommend you take the time to do some familiarization with the whole search space and Solr. The excellent book here: http://www.amazon.com/Solr-Action-Trey-Grainger/dp/1617291021/ref=sr_1_1?ie=UTF8qid=1434513284sr=8-1keywords=apache+solrpebp=1434513287267perid=0YRK508J0HJ1N3BAX20E will give you the grounding you need to get the most out of Solr. Best, Erick On Tue, Jun 16, 2015 at 8:27 PM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: The long content is from when I tried to index PDF files. As some PDF files has alot of words in the content, it will lead to the *UTF8 encoding is longer than the max length 32766 error.* I think the problem is the content size of the PDF file exceed 32766 characters? I'm trying to accomplish to be able to index documents that can be of any size (even those with very large contents), and build the suggester from there. Also, when I do a search, it shouldn't be returning whole fields, but just to return a portion of the sentence. Regards, Edwin On 16 June 2015 at 23:02, Erick Erickson erickerick...@gmail.com wrote: The suggesters are built to return whole fields. You _might_ be able to add multiple fragments to a multiValued entry and get fragments, I haven't tried that though and I suspect that actually you'd get the same thing.. This is an XY problem IMO. Please describe exactly what you're trying to accomplish, with examples rather than continue to pursue this path. It sounds like you want spellcheck or similar. The _point_ behind the suggesters is that they handle multiple-word suggestions by returning he whole field. So putting long text fields into them is not going to work. Best, Erick On Tue, Jun 16, 2015 at 1:46 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: in line : 2015-06-16 4:43 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Thanks Benedetti, I've change to the AnalyzingInfixLookup approach, and it is able to start searching from the middle of the field. However, is it possible to make the suggester to show only part of the content of the field (like 2 or 3 fields after), instead of the entire content/sentence, which can be quite long? I assume you use fields in the place of tokens. The answer is yes, I already said that in my previous mail, I invite you to read carefully the answers and the documentation linked ! Related the excessive dimensions of tokens. This is weird, what are you trying to autocomplete ? I really doubt would be useful for a user to see super long auto completed terms. Cheers Regards, Edwin On 15 June 2015 at 17:33, Alessandro Benedetti benedetti.ale...@gmail.com wrote: ehehe Edwin, I think you should read again the document I linked time ago : http://lucidworks.com/blog/solr-suggester/ The suggester you used is not meant to provide infix suggestions. The fuzzy suggester is working on a fuzzy basis , with the *starting* terms of a field content. What you are looking for is actually one of the Infix Suggesters. For example the AnalyzingInfixLookup approach. When working with Suggesters is important first to make a distinction : 1) Returning the full content of the field ( analysisInfix or Fuzzy) 2) Returning token(s) ( Free Text Suggester) Then the second difference is : 1) Infix suggestions ( from the middle of the field content) 2) Classic suggester ( from the beginning of the field content) Clarified that, will be quite simple to work with suggesters. Cheers 2015-06-15 9:28 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: I've indexed a rich-text documents with the following content: This is a testing rich text documents to test the uploading of files to Solr When I tried to use the suggestion, it return me the entire field in the content once I enter suggest?q=t. However, when I tried to search for q='rich', I don't get any results returned. This is my current configuration for the suggester: searchComponent name=suggest
Joins with comma separated values
Hi, We have some master data and some content data. Master data would be things like userid, name, email id etc. Our content data for example is a blog. The blog has certain fields which are comma separated ids that point to the master data. E.g. UserIDs of people who have commented on a particular blog can be found in the blog table in a comma separated field of userids. Similarly userids of people who have liked the blog can be found in a comma separated field of userids. How do I join this comma separated list of userids with the master data so that I can get the other details of the user such as name, email, picture etc? Thanks, Advait
Re: How to create concatenated token
Hi Erick, Thank you so much, it will be helpful for me to learn how to save the state of token. I has no idea of how to save state of previous tokens due to this it was difficult to generate a concatenated token in the last. So is there anything should I read to learn more about it. With Regards Aman Tandon On Wed, Jun 17, 2015 at 9:20 AM, Erick Erickson erickerick...@gmail.com wrote: I really question the premise, but have a look at: https://issues.apache.org/jira/browse/SOLR-7193 Note that this is not committed and I haven't reviewed it so I don't have anything to say about that. And you'd have to implement it as a custom Filter. Best, Erick On Tue, Jun 16, 2015 at 5:55 PM, Aman Tandon amantandon...@gmail.com wrote: Hi, Any guesses, how could I achieve this behaviour. With Regards Aman Tandon On Tue, Jun 16, 2015 at 8:15 PM, Aman Tandon amantandon...@gmail.com wrote: e.g. Intent for solr training: fq=id: 234, 456, 545 title(solr training) typo error e.g. Intent for solr training: fq=id:(234 456 545) title:(solr training) With Regards Aman Tandon On Tue, Jun 16, 2015 at 8:13 PM, Aman Tandon amantandon...@gmail.com wrote: We has some business logic to search the user query in user intent or finding the exact matching products. e.g. Intent for solr training: fq=id: 234, 456, 545 title(solr training) As we can see it is phrase query so it will took more time than the single stemmed token query. There are also 5-7 words phrase query. So we want to reduce the search time by implementing this feature. With Regards Aman Tandon On Tue, Jun 16, 2015 at 6:42 PM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Can I ask you why you need to concatenate the tokens ? Maybe we can find a better solution to concat all the tokens in one single big token . I find it difficult to understand the reasons behind tokenising, token filtering and then un-tokenizing again :) It would be great if you explain a little bit better what you would like to do ! Cheers 2015-06-16 13:26 GMT+01:00 Aman Tandon amantandon...@gmail.com: Hi, I have a requirement to create the concatenated token of all the tokens created from the last item of my analyzer chain. *Suppose my analyzer chain is :* * tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory catenateAll=1 splitOnNumerics=1 preserveOriginal=1/filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=15 side=front /filter class=solr.PorterStemmerFilterFactory/* I want to create a concatenated token plugin to add at concatenated token along with the last token. e.g. Solr training *Porter:-* solr train Position 1 2 *Concatenated :-* solr train solrtrain Position 1 2 Please help me out. How to create custom filter for this requirement. With Regards Aman Tandon -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: How to create concatenated token
I really question the premise, but have a look at: https://issues.apache.org/jira/browse/SOLR-7193 Note that this is not committed and I haven't reviewed it so I don't have anything to say about that. And you'd have to implement it as a custom Filter. Best, Erick On Tue, Jun 16, 2015 at 5:55 PM, Aman Tandon amantandon...@gmail.com wrote: Hi, Any guesses, how could I achieve this behaviour. With Regards Aman Tandon On Tue, Jun 16, 2015 at 8:15 PM, Aman Tandon amantandon...@gmail.com wrote: e.g. Intent for solr training: fq=id: 234, 456, 545 title(solr training) typo error e.g. Intent for solr training: fq=id:(234 456 545) title:(solr training) With Regards Aman Tandon On Tue, Jun 16, 2015 at 8:13 PM, Aman Tandon amantandon...@gmail.com wrote: We has some business logic to search the user query in user intent or finding the exact matching products. e.g. Intent for solr training: fq=id: 234, 456, 545 title(solr training) As we can see it is phrase query so it will took more time than the single stemmed token query. There are also 5-7 words phrase query. So we want to reduce the search time by implementing this feature. With Regards Aman Tandon On Tue, Jun 16, 2015 at 6:42 PM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Can I ask you why you need to concatenate the tokens ? Maybe we can find a better solution to concat all the tokens in one single big token . I find it difficult to understand the reasons behind tokenising, token filtering and then un-tokenizing again :) It would be great if you explain a little bit better what you would like to do ! Cheers 2015-06-16 13:26 GMT+01:00 Aman Tandon amantandon...@gmail.com: Hi, I have a requirement to create the concatenated token of all the tokens created from the last item of my analyzer chain. *Suppose my analyzer chain is :* * tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory catenateAll=1 splitOnNumerics=1 preserveOriginal=1/filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=15 side=front /filter class=solr.PorterStemmerFilterFactory/* I want to create a concatenated token plugin to add at concatenated token along with the last token. e.g. Solr training *Porter:-* solr train Position 1 2 *Concatenated :-* solr train solrtrain Position 1 2 Please help me out. How to create custom filter for this requirement. With Regards Aman Tandon -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Raw lucene query for a given solr query
: You can get raw query (and other debug information) with debug=true : paramter. more specifically -- if you are writting a custom SearchComponent, and want to access the underlying Query object produced by the parsers that SolrIndexSearcher has executed, you can do so the same way the debug component does... https://svn.apache.org/viewvc/lucene/dev/branches/branch_5x/solr/core/src/java/org/apache/solr/handler/component/DebugComponent.java?view=markup#l98 : Hi, : : We have a few custom solrcloud components that act as value sources inside : solrcloud for boosting items in the index. I want to get the final raw : lucene query used by solr for querying the index (for debugging purposes). : : Is it possible to get that information? : : Kindly advise : : Thanks, : Nitin : : -Hoss http://www.lucidworks.com/
Re: solr/lucene index merge and optimize performance improvement
I think your advice on future incremental update is very useful. I will keep eye on that. Actually, I am currently interested in how to boost merging/optimizing performance of single solr instance. Parallelism at MapReduce level does not help merging/optimizing much, unless Solr/Lucene internally has distributed indexing mechanism like threading. Specifically, I am talking about the parameters in // ((TieredMergePolicy) mergePolicy).setMaxMergeAtOnceExplicit( *1*); // ((TieredMergePolicy) mergePolicy).setMaxMergeAtOnce(*1*); // ((TieredMergePolicy) mergePolicy).setSegmentsPerTier(*1*); https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/map-reduce/src/java/org/apache/solr/hadoop/TreeMergeOutputFormat.java#L119-121 Do you know how they affect merging/optimizing the performance? or do you know any doc about them? I tried to uncomment them, and the performance improved. And I am considering further tune the parameters. As you mentioned, IndexWriter.forceMerge does exist in line 153 of https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/map-reduce/src/java/org/apache/solr/hadoop/TreeMergeOutputFormat.java#L153 I am very grateful for your advice. Thanks a lot. On Mon, Jun 15, 2015 at 10:39 PM, Erick Erickson erickerick...@gmail.com wrote: Ah, OK. For very slowly changing indexes optimize can makes sense. Do note, though, that if you incrementally index after the full build, and especially if you update documents, you're laying a trap for the future. Let's say you optimize down to a single segment. The default TieredMergePolicy tries to merge similar size segments. But now you have one huge segment and docs will be marked as deleted from that segment, but not cleaned up until that segment is merged, which won't happen for a long time since it is so much bigger (I'm assuming) than the segments the incremental indexing will create. Now, the percentage of deleted documents weighs quite heavily in the decision what segments to merge, so it might not matter. It's just something to be aware of. Surely benchmarking is in order as you indicated. The Lucene-level IndexWriter.forceMerge method seems to be what you need though, although if you're working over HDFS I'm in unfamiliar territory. But the constructors to IndexWriter take a Directory, and the HdfsDirectory extends BaseDirectory which extends Directory so if you can set up an HdfsDIrectory it should just work. I haven't personally tried it though. I saw something recently where optimization helped considerably in a sharded situation where the rows parameter was 400 (10 shards). My belief is that what was really happening was that the first-pass of a distributed search was getting slowed by disk seeks across multiple smaller segments. I'm waiting for SOLR-6810 which should impact that problem. Don't know if it applies to your situation or not though. HTH, Erick On Mon, Jun 15, 2015 at 8:30 PM, Shenghua(Daniel) Wan wansheng...@gmail.com wrote: Hi, Erick, First thanks for sharing the ideas. I am further giving more context here accordingly. 1. why optimize? I have done some experiments to compare the query response time, and there is some difference. In addition, the searcher will be customer-facing. I think any performance boost will be worthwhile unless the indexing will be more frequent. However, more benchmark will be necessary to quantize the margin. 2. Why embedded solr server? I adopted the idea from Mark Miller's map-reduce indexing and build on top of its original contribution to Solr. It launches an embedded solr server at the end of reducer stages. Basically a solr instance is brought up and fed with documents. Then the index is generated at each reducer. Then the indexes are merged, and optimized if desired. Thanks. On Mon, Jun 15, 2015 at 5:06 PM, Erick Erickson erickerick...@gmail.com wrote: The first question is why you're optimizing at all. It's not recommended unless you can demonstrate that an optimized index is giving you enough of a performance boost to be worth the effort. And why are you using embedded solr server? That's kind of unusual so I wonder if you've gone down a wrong path somewhere. In other words this feels like an XY problem, you're specifically asking about a task without explaining the problem you're trying to solve, there may be better alternatives. Best, Erick On Mon, Jun 15, 2015 at 4:56 PM, Shenghua(Daniel) Wan wansheng...@gmail.com wrote: Hi, Do you have any suggestions to improve the performance for merging and optimizing index? I have been using embedded solr server to merge and optimize the index. I am looking for the right parameters to tune. My use case have about 300 fields plus 250 copyfields, and moderate doc size (about 65K each doc averagely) https://wiki.apache.org/solr/MergingSolrIndexes does not help
Do we need to add docValues=true to _version_ field in schema.xml?
For the _version_ field in the schema.xml, do we need to set it be docValues=true? field name=_version_ type=long indexed=true stored=true/ As we noticed there are FieldCache for _version_ in the solr stats: http://lucene.472066.n3.nabble.com/file/n4212123/IMAGE%245A8381797719FDA9.jpg -- View this message in context: http://lucene.472066.n3.nabble.com/Do-we-need-to-add-docValues-true-to-version-field-in-schema-xml-tp4212123.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Raw lucene query for a given solr query
Hi, You can get raw query (and other debug information) with debug=true paramter. Regards, Tomoko 2015-06-16 8:10 GMT+09:00 KNitin nitin.t...@gmail.com: Hi, We have a few custom solrcloud components that act as value sources inside solrcloud for boosting items in the index. I want to get the final raw lucene query used by solr for querying the index (for debugging purposes). Is it possible to get that information? Kindly advise Thanks, Nitin
Re: Solr's suggester results
in line : 2015-06-16 4:43 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: Thanks Benedetti, I've change to the AnalyzingInfixLookup approach, and it is able to start searching from the middle of the field. However, is it possible to make the suggester to show only part of the content of the field (like 2 or 3 fields after), instead of the entire content/sentence, which can be quite long? I assume you use fields in the place of tokens. The answer is yes, I already said that in my previous mail, I invite you to read carefully the answers and the documentation linked ! Related the excessive dimensions of tokens. This is weird, what are you trying to autocomplete ? I really doubt would be useful for a user to see super long auto completed terms. Cheers Regards, Edwin On 15 June 2015 at 17:33, Alessandro Benedetti benedetti.ale...@gmail.com wrote: ehehe Edwin, I think you should read again the document I linked time ago : http://lucidworks.com/blog/solr-suggester/ The suggester you used is not meant to provide infix suggestions. The fuzzy suggester is working on a fuzzy basis , with the *starting* terms of a field content. What you are looking for is actually one of the Infix Suggesters. For example the AnalyzingInfixLookup approach. When working with Suggesters is important first to make a distinction : 1) Returning the full content of the field ( analysisInfix or Fuzzy) 2) Returning token(s) ( Free Text Suggester) Then the second difference is : 1) Infix suggestions ( from the middle of the field content) 2) Classic suggester ( from the beginning of the field content) Clarified that, will be quite simple to work with suggesters. Cheers 2015-06-15 9:28 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: I've indexed a rich-text documents with the following content: This is a testing rich text documents to test the uploading of files to Solr When I tried to use the suggestion, it return me the entire field in the content once I enter suggest?q=t. However, when I tried to search for q='rich', I don't get any results returned. This is my current configuration for the suggester: searchComponent name=suggest class=solr.SuggestComponent lst name=suggester str name=namemySuggester/str str name=lookupImplFuzzyLookupFactory/str str name=dictionaryImplDocumentDictionaryFactory/str str name=fieldSuggestion/str str name=suggestAnalyzerFieldTypesuggestType/str str name=buildOnStartuptrue/str str name=buildOnCommitfalse/str /lst /searchComponent requestHandler name=/suggest class=solr.SearchHandler startup=lazy lst name=defaults str name=wtjson/str str name=indenttrue/str str name=suggesttrue/str str name=suggest.count10/str str name=suggest.dictionarymySuggester/str /lst arr name=components strsuggest/str /arr /requestHandler Is it possible to allow the suggester to return something even from the middle of the sentence, and also not to return the entire sentence if the sentence. Perhaps it should just suggest the next 2 or 3 fields, and to return more fields as the users type. For example, When user type 'this', it should return 'This is a testing' When user type 'this is a testing', it should return 'This is a testing rich text documents'. Regards, Edwin -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Phrase query get converted to SpanNear with slop 1 instead of 0
Ok. Thank you Chris. It is a custom Query parser. I will check my Query parser on where it inject the slop 1. On Tue, Jun 16, 2015 at 3:26 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : I encounter this peculiar case with solr 4.10.2 where the parsed query : doesnt seem to be logical. : : PHRASE23(reduce workforce) == : SpanNearQuery(spanNear([spanNear([Contents:reduceä, : Contents:workforceä], 1, true)], 23, true)) 1) that does not appear to be a parser syntax of any parser that comes with Solr (that i know of) so it's possible that whatever custom parser you are using has a bug in it. 2) IIRC, with span queries (which unlike PhraseQueries explicitly support both in-order, and out of order nearness) a slop of 0 is going to require that the 2 spans overlap and occupy the exact same position -- a span of 1 means that they differ by a single position. -Hoss http://www.lucidworks.com/ -- *Ariya *
Re: Facet on same field in different ways
Hi Phanindra, Have you tried this syntax ? facet=truefacet.field={!ex=st key=terms facet.limit=5 facet.prefix=ap}query_termsfacet.field={!key=terms2 facet.limit=1}query_termsrows=0facet.mincount=1 This seems the proper syntax, I found it here : https://issues.apache.org/jira/browse/SOLR-4717 Is this solving your problem ? Cheers 2015-06-16 0:05 GMT+01:00 Phanindra R phani...@gmail.com: Hi guys, Is there a way to facet on same field in *different ways?* For example, using a different facet.prefix. Here are the details facet.field={!key=myKey}myFieldfacet.prefix=p == works facet.field={!key=myKey}myFieldf.myField.facet.prefix=p == works facet.field={!key=myKey}myFieldf.myKey.facet.prefix=p ==* doesn't work (ref: Solr-1351)* In addition, when I try *f.myKey.facet.range.gap=2.0.* it actually doesn't recognize it and throws the error: Missing required parameter: f.myField.facet.range.gap (or default: facet.range.gap) I'm using Solr 4.10 Ref: https://issues.apache.org/jira/browse/SOLR-1351 Thanks -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Phrase query get converted to SpanNear with slop 1 instead of 0
Hi Ariya, I think Hossman specified you that the slop 1 is fine in your use case :) Of course in the case using span queries was what you were expecting ! Cheers 2015-06-16 10:13 GMT+01:00 ariya bala ariya...@gmail.com: Ok. Thank you Chris. It is a custom Query parser. I will check my Query parser on where it inject the slop 1. On Tue, Jun 16, 2015 at 3:26 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : I encounter this peculiar case with solr 4.10.2 where the parsed query : doesnt seem to be logical. : : PHRASE23(reduce workforce) == : SpanNearQuery(spanNear([spanNear([Contents:reduceä, : Contents:workforceä], 1, true)], 23, true)) 1) that does not appear to be a parser syntax of any parser that comes with Solr (that i know of) so it's possible that whatever custom parser you are using has a bug in it. 2) IIRC, with span queries (which unlike PhraseQueries explicitly support both in-order, and out of order nearness) a slop of 0 is going to require that the 2 spans overlap and occupy the exact same position -- a span of 1 means that they differ by a single position. -Hoss http://www.lucidworks.com/ -- *Ariya * -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
phrase matches returning near matches
Hiya, I've been looking for documentation that would point to where I could modify or explain why 'near neighbours' are returned from a phrase search. If I search for: manage change I get back a document that contains this will help in your management of lots more words... changes. It's relevant but I'd like to understand why solr is returning it. Is it a combination of fuzzy/slop? The distance between the two variations of the two words in the document is quite large. thanks, Alistair -- mov eax,1 mov ebx,0 int 80h
Re: phrase matches returning near matches
Can you show us how the query is parsed ? You didn't tell us nothing about the query parser you are using. Enable the debugQuery=true will show you how the query is parsed and this will be quite useful for us. Cheers 2015-06-16 11:22 GMT+01:00 Alistair Young alistair.yo...@uhi.ac.uk: Hiya, I've been looking for documentation that would point to where I could modify or explain why 'near neighbours' are returned from a phrase search. If I search for: manage change I get back a document that contains this will help in your management of lots more words... changes. It's relevant but I'd like to understand why solr is returning it. Is it a combination of fuzzy/slop? The distance between the two variations of the two words in the document is quite large. thanks, Alistair -- mov eax,1 mov ebx,0 int 80h -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
What contribute to a Solr core's FieldCache entry_count?
For the fieldCache, what determines the entries_count? Is each search request containing a sort on an non-docValues field contribute one entry to the entries_count? For example, search A ( q=owner:1sort=maildate asc ) and search b ( q=owner:2sort=maildate asc ) will contribute 2 field cache entries ? I have a collection containing only one core, and there is only one doc within it, why there are so many lucene fieldCache? http://lucene.472066.n3.nabble.com/file/n4212148/%244FA9F550C60D3BA2.jpg http://lucene.472066.n3.nabble.com/file/n4212148/Untitled.png -- View this message in context: http://lucene.472066.n3.nabble.com/What-contribute-to-a-Solr-core-s-FieldCache-entry-count-tp4212148.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: phrase matches returning near matches
it¹s a useful behaviour. I¹d just like to understand where it¹s deciding the document is relevant. debug output is: lst name=debug str name=rawquerystringdc.description:manage change/str str name=querystringdc.description:manage change/str str name=parsedqueryPhraseQuery(dc.description:manag chang)/str str name=parsedquery_toStringdc.description:manag chang/str lst name=explain str name=tst:test 1.2008798 = (MATCH) weight(dc.description:manag chang in 221) [DefaultSimilarity], result of: 1.2008798 = fieldWeight in 221, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = phraseFreq=1.0 9.6070385 = idf(), sum of: 4.0365543 = idf(docFreq=101, maxDocs=2125) 5.5704846 = idf(docFreq=21, maxDocs=2125) 0.125 = fieldNorm(doc=221) /str /lst str name=QParserLuceneQParser/str lst name=timing double name=time41.0/double lst name=prepare double name=time3.0/double lst name=query double name=time0.0/double /lst lst name=facet double name=time0.0/double /lst lst name=mlt double name=time0.0/double /lst lst name=highlight double name=time0.0/double /lst lst name=stats double name=time0.0/double /lst lst name=debug double name=time0.0/double /lst /lst lst name=process double name=time35.0/double lst name=query double name=time0.0/double /lst lst name=facet double name=time0.0/double /lst lst name=mlt double name=time0.0/double /lst lst name=highlight double name=time0.0/double /lst lst name=stats double name=time0.0/double /lst lst name=debug double name=time35.0/double /lst /lst /lst /lst thanks, Alistair -- mov eax,1 mov ebx,0 int 80h On 16/06/2015 11:26, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Can you show us how the query is parsed ? You didn't tell us nothing about the query parser you are using. Enable the debugQuery=true will show you how the query is parsed and this will be quite useful for us. Cheers 2015-06-16 11:22 GMT+01:00 Alistair Young alistair.yo...@uhi.ac.uk: Hiya, I've been looking for documentation that would point to where I could modify or explain why 'near neighbours' are returned from a phrase search. If I search for: manage change I get back a document that contains this will help in your management of lots more words... changes. It's relevant but I'd like to understand why solr is returning it. Is it a combination of fuzzy/slop? The distance between the two variations of the two words in the document is quite large. thanks, Alistair -- mov eax,1 mov ebx,0 int 80h -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: phrase matches returning near matches
According to your debug you are using a default Lucene Query Parser. This surprise me as i would expect with that query a match with distance 0 between the 2 terms . Are you sure nothing else is that field that matches the phrase query ? From the documentation Lucene supports finding words are a within a specific distance away. To do a proximity search use the tilde, ~, symbol at the end of a Phrase. For example to search for a apache and jakarta within 10 words of each other in a document use the search: jakarta apache~10 Cheers 2015-06-16 11:33 GMT+01:00 Alistair Young alistair.yo...@uhi.ac.uk: it¹s a useful behaviour. I¹d just like to understand where it¹s deciding the document is relevant. debug output is: lst name=debug str name=rawquerystringdc.description:manage change/str str name=querystringdc.description:manage change/str str name=parsedqueryPhraseQuery(dc.description:manag chang)/str str name=parsedquery_toStringdc.description:manag chang/str lst name=explain str name=tst:test 1.2008798 = (MATCH) weight(dc.description:manag chang in 221) [DefaultSimilarity], result of: 1.2008798 = fieldWeight in 221, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = phraseFreq=1.0 9.6070385 = idf(), sum of: 4.0365543 = idf(docFreq=101, maxDocs=2125) 5.5704846 = idf(docFreq=21, maxDocs=2125) 0.125 = fieldNorm(doc=221) /str /lst str name=QParserLuceneQParser/str lst name=timing double name=time41.0/double lst name=prepare double name=time3.0/double lst name=query double name=time0.0/double /lst lst name=facet double name=time0.0/double /lst lst name=mlt double name=time0.0/double /lst lst name=highlight double name=time0.0/double /lst lst name=stats double name=time0.0/double /lst lst name=debug double name=time0.0/double /lst /lst lst name=process double name=time35.0/double lst name=query double name=time0.0/double /lst lst name=facet double name=time0.0/double /lst lst name=mlt double name=time0.0/double /lst lst name=highlight double name=time0.0/double /lst lst name=stats double name=time0.0/double /lst lst name=debug double name=time35.0/double /lst /lst /lst /lst thanks, Alistair -- mov eax,1 mov ebx,0 int 80h On 16/06/2015 11:26, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Can you show us how the query is parsed ? You didn't tell us nothing about the query parser you are using. Enable the debugQuery=true will show you how the query is parsed and this will be quite useful for us. Cheers 2015-06-16 11:22 GMT+01:00 Alistair Young alistair.yo...@uhi.ac.uk: Hiya, I've been looking for documentation that would point to where I could modify or explain why 'near neighbours' are returned from a phrase search. If I search for: manage change I get back a document that contains this will help in your management of lots more words... changes. It's relevant but I'd like to understand why solr is returning it. Is it a combination of fuzzy/slop? The distance between the two variations of the two words in the document is quite large. thanks, Alistair -- mov eax,1 mov ebx,0 int 80h -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England