RE: Solr Phonetic Search Highlight issue in search results
Thanks a lot Erick for trying this out. Will wait for a reply from your end. Thanks Regards, Soumya. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 01 April 2013 05:46 PM To: solr-user@lucene.apache.org Subject: Re: Solr Phonetic Search Highlight issue in search results Good question, you're causing me to think... about code I know very little about G. So rather than spouting off, I tried it and.. it works fine for me, either with or without using fast vector highlighter on, admittedly, a very simple test. So I think I'd try peeling off all the extra stuff you've put into your configs (sorry, I don't have time right now to try to reproduce) and get the very simple case working, then build the rest back up and see where the problem begins. Sorry for the mis-direction! Erick On Mon, Apr 1, 2013 at 1:07 AM, Soumyanayan Kar soumyanayan@rebaca.com wrote: Hi Erick, Thanks for the reply. But help me understand this: If Solr is able to isolate the two documents which contain the term fact being the phonetic equivalent of the search term fakt, then why will it be unable to highlight the terms based on the same logic it uses to search the documents. Also, it is correctly highlighting the results in other searches which are also approximate searches and not exact ones for eg. Fuzzy or Synonym search. In these cases also the highlights in the search results are far from the actual search term but still they are getting correctly highlighted. Maybe I am getting it completely wrong but it looks like there is something wrong with my implementation. Thanks Regards, Soumya. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 27 March 2013 06:07 AM To: solr-user@lucene.apache.org Subject: Re: Solr Phonetic Search Highlight issue in search results How would you expect it to highlight successfully? The term is fakt, there's nothing built in (and, indeed couldn't be) to un-phoneticize it into fact and apply that to the Content field. The whole point of phonetic processing is to do a lossy translation from the word into some variant, losing precision all the way. So this behavior is unsurprising... Best Erick On Tue, Mar 26, 2013 at 7:28 AM, Soumyanayan Kar soumyanayan@rebaca.com wrote: When we are issuing a query with Phonetic Search, it is returning the correct documents but not returning the highlights. When we use Stemming or Synonym searches we are getting the proper highlights. For example, when we execute a phonetic query for the term fakt(ContentSearchPhonetic:fakt) in the Solr Admin interface, it returns two documents containing the term fact(phonetic token equivalent), but the list of highlights is empty as shown in the response below. response lst name=responseHeader int name=status0/int int name=QTime16/int lst name=params str name=qContentSearchPhonetic:fakt/str str name=wtxml/str /lst /lst result name=response numFound=2 start=0 doc long name=DocId1/long str name=DocTitleDoc 1/str str name=ContentAnyway, this game was excellent and was well worth the time. The graphics are truly amazing and the sound track was pretty pleasant also. The preacher was in fact a thief./str long name=_version_1430480998833848320/long /doc doc long name=DocId2/long str name=DocTitleDoc 2/str str name=Contentstunning. The preacher was in fact an excellent thief who had stolen the original manuscript of Hamlet from an exhibit on the Riviera, where he also acquired his remarkable and tan./str long name=_version_1430480998841188352/long /doc /result lst name=highlighting lst name=1/ lst name=2/ /lst /response Relevant section of Solr schema: field name=DocId type=long indexed=true stored=true required=true/ field name=DocTitle type=string indexed=false stored=true required=true/ field name=Content type=text_general indexed=false stored=true required=true/ field name=ContentSearch type=text_general indexed=true stored=false multiValued=true/ field name=ContentSearchStemming type=text_stem indexed=true stored=false multiValued=true/ field name=ContentSearchPhonetic type=text_phonetic indexed=true stored=false multiValued=true/ field name=ContentSearchSynonym type=text_synonym indexed=true stored=false multiValued=true/ uniqueKeyDocId/uniqueKey copyField source=Content dest=ContentSearch/ copyField source=Content dest=ContentSearchStemming/ copyField source=Content dest=ContentSearchPhonetic/ copyField source=Content dest=ContentSearchSynonym/ fieldType name=text_stem class=solr.TextField analyzer
RE: Solr Phonetic Search Highlight issue in search results
Hi Erick, Thanks for the reply. But help me understand this: If Solr is able to isolate the two documents which contain the term fact being the phonetic equivalent of the search term fakt, then why will it be unable to highlight the terms based on the same logic it uses to search the documents. Also, it is correctly highlighting the results in other searches which are also approximate searches and not exact ones for eg. Fuzzy or Synonym search. In these cases also the highlights in the search results are far from the actual search term but still they are getting correctly highlighted. Maybe I am getting it completely wrong but it looks like there is something wrong with my implementation. Thanks Regards, Soumya. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 27 March 2013 06:07 AM To: solr-user@lucene.apache.org Subject: Re: Solr Phonetic Search Highlight issue in search results How would you expect it to highlight successfully? The term is fakt, there's nothing built in (and, indeed couldn't be) to un-phoneticize it into fact and apply that to the Content field. The whole point of phonetic processing is to do a lossy translation from the word into some variant, losing precision all the way. So this behavior is unsurprising... Best Erick On Tue, Mar 26, 2013 at 7:28 AM, Soumyanayan Kar soumyanayan@rebaca.com wrote: When we are issuing a query with Phonetic Search, it is returning the correct documents but not returning the highlights. When we use Stemming or Synonym searches we are getting the proper highlights. For example, when we execute a phonetic query for the term fakt(ContentSearchPhonetic:fakt) in the Solr Admin interface, it returns two documents containing the term fact(phonetic token equivalent), but the list of highlights is empty as shown in the response below. response lst name=responseHeader int name=status0/int int name=QTime16/int lst name=params str name=qContentSearchPhonetic:fakt/str str name=wtxml/str /lst /lst result name=response numFound=2 start=0 doc long name=DocId1/long str name=DocTitleDoc 1/str str name=ContentAnyway, this game was excellent and was well worth the time. The graphics are truly amazing and the sound track was pretty pleasant also. The preacher was in fact a thief./str long name=_version_1430480998833848320/long /doc doc long name=DocId2/long str name=DocTitleDoc 2/str str name=Contentstunning. The preacher was in fact an excellent thief who had stolen the original manuscript of Hamlet from an exhibit on the Riviera, where he also acquired his remarkable and tan./str long name=_version_1430480998841188352/long /doc /result lst name=highlighting lst name=1/ lst name=2/ /lst /response Relevant section of Solr schema: field name=DocId type=long indexed=true stored=true required=true/ field name=DocTitle type=string indexed=false stored=true required=true/ field name=Content type=text_general indexed=false stored=true required=true/ field name=ContentSearch type=text_general indexed=true stored=false multiValued=true/ field name=ContentSearchStemming type=text_stem indexed=true stored=false multiValued=true/ field name=ContentSearchPhonetic type=text_phonetic indexed=true stored=false multiValued=true/ field name=ContentSearchSynonym type=text_synonym indexed=true stored=false multiValued=true/ uniqueKeyDocId/uniqueKey copyField source=Content dest=ContentSearch/ copyField source=Content dest=ContentSearchStemming/ copyField source=Content dest=ContentSearchPhonetic/ copyField source=Content dest=ContentSearchSynonym/ fieldType name=text_stem class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SnowballPorterFilterFactory/ /analyzer /fieldType fieldType name=text_phonetic class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.PhoneticFilterFactory encoder=DoubleMetaphone inject=false/ /analyzer /fieldType fieldType name=text_synonym class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ /analyzer /fieldType Relevant section of Solr config: requestHandler name=/select class=solr.SearchHandler !-- default values for query parameters can be specified, these will be overridden by parameters in the request -- lst name=defaults str name
Solr Phonetic Search Highlight issue in search results
When we are issuing a query with Phonetic Search, it is returning the correct documents but not returning the highlights. When we use Stemming or Synonym searches we are getting the proper highlights. For example, when we execute a phonetic query for the term fakt(ContentSearchPhonetic:fakt) in the Solr Admin interface, it returns two documents containing the term fact(phonetic token equivalent), but the list of highlights is empty as shown in the response below. response lst name=responseHeader int name=status0/int int name=QTime16/int lst name=params str name=qContentSearchPhonetic:fakt/str str name=wtxml/str /lst /lst result name=response numFound=2 start=0 doc long name=DocId1/long str name=DocTitleDoc 1/str str name=ContentAnyway, this game was excellent and was well worth the time. The graphics are truly amazing and the sound track was pretty pleasant also. The preacher was in fact a thief./str long name=_version_1430480998833848320/long /doc doc long name=DocId2/long str name=DocTitleDoc 2/str str name=Contentstunning. The preacher was in fact an excellent thief who had stolen the original manuscript of Hamlet from an exhibit on the Riviera, where he also acquired his remarkable and tan./str long name=_version_1430480998841188352/long /doc /result lst name=highlighting lst name=1/ lst name=2/ /lst /response Relevant section of Solr schema: field name=DocId type=long indexed=true stored=true required=true/ field name=DocTitle type=string indexed=false stored=true required=true/ field name=Content type=text_general indexed=false stored=true required=true/ field name=ContentSearch type=text_general indexed=true stored=false multiValued=true/ field name=ContentSearchStemming type=text_stem indexed=true stored=false multiValued=true/ field name=ContentSearchPhonetic type=text_phonetic indexed=true stored=false multiValued=true/ field name=ContentSearchSynonym type=text_synonym indexed=true stored=false multiValued=true/ uniqueKeyDocId/uniqueKey copyField source=Content dest=ContentSearch/ copyField source=Content dest=ContentSearchStemming/ copyField source=Content dest=ContentSearchPhonetic/ copyField source=Content dest=ContentSearchSynonym/ fieldType name=text_stem class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SnowballPorterFilterFactory/ /analyzer /fieldType fieldType name=text_phonetic class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.PhoneticFilterFactory encoder=DoubleMetaphone inject=false/ /analyzer /fieldType fieldType name=text_synonym class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ /analyzer /fieldType Relevant section of Solr config: requestHandler name=/select class=solr.SearchHandler !-- default values for query parameters can be specified, these will be overridden by parameters in the request -- lst name=defaults str name=echoParamsexplicit/str int name=rows100/int str name=dfContentSearch/str bool name=hltrue/bool str name=hl.flContent/str str name=f.Content.hl.fragsize150/str str name=f.Content.hl.snippets40/str /lst /requestHandler searchComponent class=solr.HighlightComponent name=highlight highlighting !-- Configure the standard fragmenter -- !-- This could most likely be commented out in the default case -- fragmenter name=gap default=true class=solr.highlight.GapFragmenter lst name=defaults int name=hl.fragsize100/int /lst /fragmenter !-- A regular-expression-based fragmenter (for sentence extraction) -- fragmenter name=regex class=solr.highlight.RegexFragmenter lst name=defaults !-- slightly smaller fragsizes work better because of slop -- int name=hl.fragsize70/int !-- allow 50% slop on fragment sizes -- float name=hl.regex.slop0.5/float !-- a basic sentence pattern -- str name=hl.regex.pattern[-\w ,/\n\quot;apos;]{20,200}/str /lst /fragmenter Has anyone experienced this kind of behaviour before? Need some direction for troubleshooting. Soumya.
Advanced Search Option in Solr corresponding to DtSearch options
Hi, We are replacing the search and indexing module in an application from DtSearch to Solr using solrnet as the .net Solr client library. We are relatively new to Solr/Lucene and would need some help/direction to understand the more advanced search options in Solr. The current application supports the following search options using DtSearch: 1)Word(s) or phrase 2)Exact words or phrases 3)Not these words or phrases 4)One or more of words(A OR B OR C) 5)Proximity of word with n words of another word 6)Numeric range - From - To 7)Option . Stemming(search* finds searching or searches) . Synonym(search finds seek or look) . Fuzzy within n letters(p%arts finds paris) . Phonic homonyms(#Smith also finds Smithe and Smythe) As an example the search query that gets generated to be posted to DtSearch for the below use case: 1. Search Phrase: generic collection 2. Exact Phrase: linq 3. Not these words: sql 4. One or more of these words: ICollection or ArrayList or Hashtable 5. Proximity: csharp within 4 words of language 6. Options: a. Stemming b. Synonym c. Fuzzy within 2 letters d. Phonic homonyms Search Query: generic* collection* generic collection #generic #collection g%%eneric c%%ollection linq -sql ICollection OR ArrayList OR Hashtable csharp w/4 language We have been able to do simple searches(singular term search in a file content) with highlights with Solr. Now we need to replace these options with Solr/Lucene. Can anybody provide some directions on what/where should we be looking. Thanks Regards, Soumya.
Issue with mutiple records in full text search
Hi, We are trying to use solr for a text based search solution in a web application. The documents that are getting indexed are essentially text based files like *.txt, *.pdf, etc. We are using the Tika extraction plugin to extract the text content from the files and storing it using a text_general type field in the solr schema file. Relevant part of the schema file: field name=CaseId type=long indexed=true stored=true required=true/ field name=CaseTitle type=string indexed=false stored=true required=true/ field name=CaseNumber type=string indexed=false stored=true required=true/ field name=MediaType type=int indexed=true stored=true required=true/ field name=MediaId type=string indexed=true stored=true required=true/ field name=CaptionName type=string indexed=false stored=true required=true/ field name=MediaPath type=string indexed=false stored=true required=true/ field name=MimeType type=string indexed=false stored=true required=false/ field name=DocumentNumber type=string indexed=false stored=true required=false/ field name=DeponentFullName type=string indexed=false stored=true required=false/ field name=DepositionDate type=date indexed=false stored=true required=false/ field name=DocCreatedDate type=date indexed=false stored=true required=false/ field name=DocModifiedDate type=date indexed=false stored=true required=false/ field name=Content type=text_general indexed=false stored=true required=true/ field name=WorkgroupIdList type=text_general indexed=true stored=true required=true multiValued=true/ field name=ContentSearch type=text_general indexed=true stored=false multiValued=true/ field name=_version_ type=long indexed=true stored=true/ uniqueKeyMediaId/uniqueKey copyField source=Content dest=ContentSearch/ We are using a .net based solution and using the solrnet client to communicate with Solr. The content field is supposed to store the text content of the file and the ContentSearch field will be used for executing the search. While the documents are getting indexed properly, while executing search we are getting only the first occurrence of the search term returned for each document. For example, if we have a.txt and b.pdf which are indexed, and the search term case exists in both the documents multiple times(a.txt - 7 hits, b.pdf - 10 hits), when executing a search for case against both the documents, we are getting two records returned which are the first occurrences of the search term in the respective docs, while this should return 17 hits. Used Luke to test the index records but cannot find anything apparently wrong. Is this something to do with the type(text_general) of the search field or the way we are loading the entire content of the file into one index document? Soumya. Thanks Regards, Soumya.
Issue with mutiple records in full text search
Hi, We are trying to use solr for a text based search solution in a web application. The documents that are getting indexed are essentially text based files like *.txt, *.pdf, etc. We are using the Tika extraction plugin to extract the text content from the files and storing it using a text_general type field in the solr schema file. Relevant part of the schema file: field name=CaseId type=long indexed=true stored=true required=true/ field name=CaseTitle type=string indexed=false stored=true required=true/ field name=CaseNumber type=string indexed=false stored=true required=true/ field name=MediaType type=int indexed=true stored=true required=true/ field name=MediaId type=string indexed=true stored=true required=true/ field name=CaptionName type=string indexed=false stored=true required=true/ field name=MediaPath type=string indexed=false stored=true required=true/ field name=MimeType type=string indexed=false stored=true required=false/ field name=DocumentNumber type=string indexed=false stored=true required=false/ field name=DeponentFullName type=string indexed=false stored=true required=false/ field name=DepositionDate type=date indexed=false stored=true required=false/ field name=DocCreatedDate type=date indexed=false stored=true required=false/ field name=DocModifiedDate type=date indexed=false stored=true required=false/ field name=Content type=text_general indexed=false stored=true required=true/ field name=WorkgroupIdList type=text_general indexed=true stored=true required=true multiValued=true/ field name=ContentSearch type=text_general indexed=true stored=false multiValued=true/ field name=_version_ type=long indexed=true stored=true/ uniqueKeyMediaId/uniqueKey copyField source=Content dest=ContentSearch/ We are using a .net based solution and using the solrnet client to communicate with Solr. The content field is supposed to store the text content of the file and the ContentSearch field will be used for executing the search. While the documents are getting indexed properly, while executing search we are getting only the first occurrence of the search term returned for each document. For example, if we have a.txt and b.pdf which are indexed, and the search term case exists in both the documents multiple times(a.txt - 7 hits, b.pdf - 10 hits), when executing a search for case against both the documents, we are getting two records returned which are the first occurrences of the search term in the respective docs, while this should return 17 hits. Used Luke to test the index records but cannot find anything apparently wrong. Is this something to do with the type(text_general) of the search field or the way we are loading the entire content of the file into one index document? Soumya.
RE: Issue with mutiple records in full text search
Thanks Jack for the explanation. But lets say if my requirement needs me to return all occurrences of the search term along with the text snippet around them for each document under the search scope, how do we go about achieving that with Solr? Thanks Regards, Soumya. -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: 29 January 2013 08:00 PM To: solr-user@lucene.apache.org Subject: Re: Issue with mutiple records in full text search The number of hits of a term in a Solr document impacts the score, but still only counts as one hit in the numFound count. Solr doesn't track hits for individual term occurrences, except that you could check the term frequency of a specific term in a specific document if you wanted, using a function query - tf(field,term) - which can also be included in the fl field list. To be clear - Solr has no concept of records, just documents and fields. -- Jack Krupansky -Original Message- From: Soumyanayan Kar Sent: Tuesday, January 29, 2013 9:01 AM To: solr-user@lucene.apache.org Subject: Issue with mutiple records in full text search Hi, We are trying to use solr for a text based search solution in a web application. The documents that are getting indexed are essentially text based files like *.txt, *.pdf, etc. We are using the Tika extraction plugin to extract the text content from the files and storing it using a text_general type field in the solr schema file. Relevant part of the schema file: field name=CaseId type=long indexed=true stored=true required=true/ field name=CaseTitle type=string indexed=false stored=true required=true/ field name=CaseNumber type=string indexed=false stored=true required=true/ field name=MediaType type=int indexed=true stored=true required=true/ field name=MediaId type=string indexed=true stored=true required=true/ field name=CaptionName type=string indexed=false stored=true required=true/ field name=MediaPath type=string indexed=false stored=true required=true/ field name=MimeType type=string indexed=false stored=true required=false/ field name=DocumentNumber type=string indexed=false stored=true required=false/ field name=DeponentFullName type=string indexed=false stored=true required=false/ field name=DepositionDate type=date indexed=false stored=true required=false/ field name=DocCreatedDate type=date indexed=false stored=true required=false/ field name=DocModifiedDate type=date indexed=false stored=true required=false/ field name=Content type=text_general indexed=false stored=true required=true/ field name=WorkgroupIdList type=text_general indexed=true stored=true required=true multiValued=true/ field name=ContentSearch type=text_general indexed=true stored=false multiValued=true/ field name=_version_ type=long indexed=true stored=true/ uniqueKeyMediaId/uniqueKey copyField source=Content dest=ContentSearch/ We are using a .net based solution and using the solrnet client to communicate with Solr. The content field is supposed to store the text content of the file and the ContentSearch field will be used for executing the search. While the documents are getting indexed properly, while executing search we are getting only the first occurrence of the search term returned for each document. For example, if we have a.txt and b.pdf which are indexed, and the search term case exists in both the documents multiple times(a.txt - 7 hits, b.pdf - 10 hits), when executing a search for case against both the documents, we are getting two records returned which are the first occurrences of the search term in the respective docs, while this should return 17 hits. Used Luke to test the index records but cannot find anything apparently wrong. Is this something to do with the type(text_general) of the search field or the way we are loading the entire content of the file into one index document? Soumya. Thanks Regards, Soumya.